-
Notifications
You must be signed in to change notification settings - Fork 13.5k
vector compare not equal and compare not not equal should be handled better #78888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
IR: define i64 @foo(i8 zeroext %0) {
Entry:
%1 = insertelement <1 x i8> poison, i8 %0, i64 0
%2 = shufflevector <1 x i8> %1, <1 x i8> poison, <8 x i32> zeroinitializer
%3 = and <8 x i8> %2, <i8 -128, i8 64, i8 32, i8 16, i8 8, i8 4, i8 2, i8 1>
%.not = icmp ne <8 x i8> %3, zeroinitializer
%4 = sext <8 x i1> %.not to <8 x i8>
%5 = bitcast <8 x i8> %4 to i64
ret i64 %5
} |
@llvm/issue-subscribers-backend-x86 Author: Niles Salter (Validark)
So I wrote some code that turns 8 bits and expands each bit into 8 bits (all 1's or all 0's) like so:
Now, I did write a version that works via fn foo(a: u8) u64 {
const x = a;
const unique_bytes: @<!-- -->Vector(8, u8) = [_]u8{ 0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01 };
const splatted = @<!-- -->as(@<!-- -->Vector(8, u8), @<!-- -->splat(x));
return @<!-- -->as(u64, @<!-- -->bitCast(@<!-- -->select(
u8,
(splatted & unique_bytes) == unique_bytes,
@<!-- -->as(@<!-- -->Vector(8, u8), @<!-- -->splat(255)),
@<!-- -->as(@<!-- -->Vector(8, u8), @<!-- -->splat(0)),
)));
} For those who can't read Zig that well, here is a diagram depicting what I am doing. I broadcast 8 bits in a vector of 8 bytes, in this case represented as
Currently, that maps to this assembly for x86-64 Zen 3: .LCPI0_0:
.byte 128
.byte 64
.byte 32
.byte 16
.byte 8
.byte 4
.byte 2
.byte 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
.zero 1
foo:
vmovd xmm0, edi
vpxor xmm1, xmm1, xmm1
vpbroadcastb xmm0, xmm0
vpand xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
vpcmpeqb xmm0, xmm0, xmm1
vpcmpeqd xmm1, xmm1, xmm1
vpxor xmm0, xmm0, xmm1
vmovq rax, xmm0 You can see the compiler wanted to optimize away using that This assembly does not seem optimal to me. I think we could instead read foo:
not dil
vpxor xmm1, xmm1, xmm1
vmovd xmm0, edi
vpbroadcastb xmm0, xmm0
vpand xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
vpcmpeqb xmm0, xmm0, xmm1
vpcmpeqd xmm1, xmm1, xmm1
vpxor xmm0, xmm0, xmm1
vmovq rax, xmm0 Obviously, this is a double-negative, although I don't know what the internal representation looks like in the LLVM passes, so maybe it's difficult for the compiler to see this. But this should give the optimal emit below. When you make any of the following changes, (invert the bitstring at the end, invert the condition, or swap 255 with 0) you do get optimal emit: - return @<!-- -->as(u64, @<!-- -->bitCast(@<!-- -->select(
+ return ~@<!-- -->as(u64, @<!-- -->bitCast(@<!-- -->select(
u8,
- (splatted & unique_bytes) == unique_bytes,
+ (splatted & unique_bytes) != unique_bytes,
- @<!-- -->as(@<!-- -->Vector(8, u8), @<!-- -->splat(255)),
+ @<!-- -->as(@<!-- -->Vector(8, u8), @<!-- -->splat(0)),
- @<!-- -->as(@<!-- -->Vector(8, u8), @<!-- -->splat(0)),
+ @<!-- -->as(@<!-- -->Vector(8, u8), @<!-- -->splat(255)),
))); optimal emit: foo:
vmovd xmm0, edi
vpxor xmm1, xmm1, xmm1
vpbroadcastb xmm0, xmm0
vpand xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
vpcmpeqb xmm0, xmm0, xmm1
vmovq rax, xmm0 However, this will give me the inverse of what I want. So there could be a -export fn foo(a: u8) u64 {
+export fn foo(a: u8, b: u64) u64 {
const x = a;
const unique_bytes: @<!-- -->Vector(8, u8) = [_]u8{ 0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01 };
const splatted = @<!-- -->as(@<!-- -->Vector(8, u8), @<!-- -->splat(x));
- return @<!-- -->as(u64, @<!-- -->bitCast(@<!-- -->select(
+ return b & @<!-- -->as(u64, @<!-- -->bitCast(@<!-- -->select(
u8,
(splatted & unique_bytes) == unique_bytes,
@<!-- -->as(@<!-- -->Vector(8, u8), @<!-- -->splat(255)),
@<!-- -->as(@<!-- -->Vector(8, u8), @<!-- -->splat(0)),
)));
} In this case, we could do an In my real code, I actually use the return value and its inverse. fn bar(x: u8) @<!-- -->Vector(8, u8) {
const byte_indices: u64 = 0x0706050403020100;
const unique_bytes: @<!-- -->Vector(8, u8) = [_]u8{ 0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01 };
const splatted = @<!-- -->as(@<!-- -->Vector(8, u8), @<!-- -->splat(x));
const vec = @<!-- -->as(u64, @<!-- -->bitCast(@<!-- -->select(
u8,
(splatted & unique_bytes) == unique_bytes,
@<!-- -->as(@<!-- -->Vector(8, u8), @<!-- -->splat(255)),
@<!-- -->as(@<!-- -->Vector(8, u8), @<!-- -->splat(0)),
)));
return @<!-- -->bitCast(pdep(byte_indices, vec) | pdep(byte_indices, ~vec));
} Currently, LLVM produces double-negative code for this type of usage. bar:
vmovd xmm0, edi
vpxor xmm1, xmm1, xmm1
movabs rcx, 506097522914230528
vpbroadcastb xmm0, xmm0
vpand xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
vpcmpeqb xmm0, xmm0, xmm1
vpcmpeqd xmm1, xmm1, xmm1
vpxor xmm0, xmm0, xmm1
vmovq rax, xmm0
pdep rdx, rcx, rax
not rax
pdep rax, rcx, rax
or rax, rdx
vmovq xmm0, rax If I invert the check for bar:
vmovd xmm0, edi
vpxor xmm1, xmm1, xmm1
movabs rcx, 506097522914230528
vpbroadcastb xmm0, xmm0
vpand xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
vpcmpeqb xmm0, xmm0, xmm1
vmovq rax, xmm0
pdep rdx, rcx, rax
not rax
pdep rax, rcx, rax
or rax, rdx
vmovq xmm0, rax
ret I am using inline assembly for the Thank you to all LLVM contributors! ‒ Validark |
The commit 72f10f7 was reverted in b9483d3 so I guess this should be reopened. It would of course have been nice if github would have picked up that the fix was reverted but... The reverting commit talks about a miscompile. EDIT: |
Thanks @mikaelholmen I'm on PTO so asked @dyung to revert until I could take a proper look. |
So I wrote some code that turns 8 bits and expands each bit into 8 bits (all 1's or all 0's) like so:
Now, I did write a version that works via
pdep(m, 0x0101010101010101) * 255
and also a SWAR implementation. But I also wanted to try a version that works via vectors. Here is what I wrote for that:For those who can't read Zig that well, here is a diagram depicting what I am doing. I broadcast 8 bits in a vector of 8 bytes, in this case represented as
abcdefgh
, and isolate a single bit in each byte. Then, for each byte where the isolated bit is present, I want that to turn into a byte of all one's, otherwise, all zeros. To do that, I do an equal-each-comparison to the vector I just used to isolate each bit. In Zig, that gives me a vector of booleans, so I do a@select
which should give me all one's corresponding to a true, and all zeros for a false. That should map to what the hardware already does anyway, so should be a no-op.Currently, that maps to this assembly for x86-64 Zen 3:
You can see the compiler wanted to optimize away using that
unique_bytes
vector twice, and changed(splatted & unique_bytes) == unique_bytes
to be(splatted & unique_bytes) != @splat(0)
. However, there is novpcmpneqb
/vector-not-equal on Zen 3 hardware. (We do have it with AVX512-enabled ISA's but we avoid it because it's slower for the hardware.) To emulate that, our instructions instead do((splatted & unique_bytes) == @splat(0))
and then xor it with all ones to flip all the bits (vpcmpeqd+vpxor
in assembly).This assembly does not seem optimal to me. I think we could instead read
LCPI0_0
into xmm1, and eliminate the need to read it in thevpand
instruction, or we could givevpcmpeqb
thexmmword ptr [rip + .LCPI0_0]
operand too, if the hardware likes that and ifvpcmpeqb
can take operands in memory. Alternatively, we could do anot edi
at the beginning or anot rax
at the end. These last two ideas I tried in my Zig code. When you changeconst x = a;
in the above function toconst x = ~a;
, it gives this assembly:Obviously, this is a double-negative, although I don't know what the internal representation looks like in the LLVM passes, so maybe it's difficult for the compiler to see this. But this should give the optimal emit below.
When you make any of the following changes, (invert the bitstring at the end, invert the condition, or swap 255 with 0) you do get optimal emit:
optimal emit:
However, this will give me the inverse of what I want. So there could be a
not
at the beginning or end. We also should be able to fold thatnot
into whatever comes next. E.g., imagine a scenario like:In this case, we could do an
andn
withb
at the end instead of inverting the vector or therax
register or the other ideas I already mentioned. I'm not here to tell you which of those options are best for the hardware, but I have given a bunch of options to eliminate instructions.Godbolt playground
In my real code, I actually use the return value and its inverse.
Currently, LLVM produces double-negative code for this type of usage.
If I invert the check for
== unique_bytes
to!= unique_bytes
, it gives me the optimal emit (which does exactly what I want).Godbolt link
I am using inline assembly for the
pdep
operation, since Zig currently does not have a way of accessing it in the LLVM-blessed way. Not sure if that throws a wrench in the optimization, but I think it should not affect it because all that's relevant is that we are consuming bothvec
and~vec
.Thank you to all LLVM contributors!
‒ Validark
The text was updated successfully, but these errors were encountered: