Skip to content

vector compare not equal and compare not not equal should be handled better #78888

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Validark opened this issue Jan 21, 2024 · 4 comments
Closed

Comments

@Validark
Copy link

Validark commented Jan 21, 2024

So I wrote some code that turns 8 bits and expands each bit into 8 bits (all 1's or all 0's) like so:

┌────────┬────────┬────────┬────────┬────────┬────────┬────────┬────────┐
│    0   │    0   │    1   │    0   │    0   │    1   │    1   │    0   │
├──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┤
│00000000│00000000│11111111│00000000│00000000│11111111│11111111│00000000│
└────────┴────────┴────────┴────────┴────────┴────────┴────────┴────────┘

Now, I did write a version that works via pdep(m, 0x0101010101010101) * 255 and also a SWAR implementation. But I also wanted to try a version that works via vectors. Here is what I wrote for that:

fn foo(a: u8) u64 {
    const x = a;
    const unique_bytes: @Vector(8, u8) = [_]u8{ 0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01 };
    const splatted = @as(@Vector(8, u8), @splat(x));
    return @as(u64, @bitCast(@select(
        u8,
        (splatted & unique_bytes) == unique_bytes,
        @as(@Vector(8, u8), @splat(255)),
        @as(@Vector(8, u8), @splat(0)),
    )));
}

For those who can't read Zig that well, here is a diagram depicting what I am doing. I broadcast 8 bits in a vector of 8 bytes, in this case represented as abcdefgh, and isolate a single bit in each byte. Then, for each byte where the isolated bit is present, I want that to turn into a byte of all one's, otherwise, all zeros. To do that, I do an equal-each-comparison to the vector I just used to isolate each bit. In Zig, that gives me a vector of booleans, so I do a @select which should give me all one's corresponding to a true, and all zeros for a false. That should map to what the hardware already does anyway, so should be a no-op.

┌────────┬────────┬────────┬────────┬────────┬────────┬────────┬────────┐
│abcdefgh│abcdefgh│abcdefgh│abcdefgh│abcdefgh│abcdefgh│abcdefgh│abcdefgh│
├───&────┼───&────┼───&────┼───&────┼───&────┼───&────┼───&────┼───&────┤
│10000000│01000000│00100000│00010000│00001000│00000100│00000010│00000001│
├──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┤
│a0000000│0b000000│00c00000│000d0000│0000e000│00000f00│000000g0│0000000h│
├───≡────┼───≡────┼───≡────┼───≡────┼───≡────┼───≡────┼───≡────┼───≡────┤
│10000000│01000000│00100000│00010000│00001000│00000100│00000010│00000001│
├──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┤
│aaaaaaaa│bbbbbbbb│cccccccc│dddddddd│eeeeeeee│ffffffff│gggggggg│hhhhhhhh│
└────────┴────────┴────────┴────────┴────────┴────────┴────────┴────────┘

Currently, that maps to this assembly for x86-64 Zen 3:

.LCPI0_0:
        .byte   128
        .byte   64
        .byte   32
        .byte   16
        .byte   8
        .byte   4
        .byte   2
        .byte   1
        .zero   1
        .zero   1
        .zero   1
        .zero   1
        .zero   1
        .zero   1
        .zero   1
        .zero   1
foo:
        vmovd   xmm0, edi
        vpxor   xmm1, xmm1, xmm1
        vpbroadcastb    xmm0, xmm0
        vpand   xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
        vpcmpeqb        xmm0, xmm0, xmm1
        vpcmpeqd        xmm1, xmm1, xmm1
        vpxor   xmm0, xmm0, xmm1
        vmovq   rax, xmm0

You can see the compiler wanted to optimize away using that unique_bytes vector twice, and changed (splatted & unique_bytes) == unique_bytes to be (splatted & unique_bytes) != @splat(0). However, there is no vpcmpneqb/vector-not-equal on Zen 3 hardware. (We do have it with AVX512-enabled ISA's but we avoid it because it's slower for the hardware.) To emulate that, our instructions instead do ((splatted & unique_bytes) == @splat(0)) and then xor it with all ones to flip all the bits (vpcmpeqd+vpxor in assembly).

This assembly does not seem optimal to me. I think we could instead read LCPI0_0 into xmm1, and eliminate the need to read it in the vpand instruction, or we could give vpcmpeqb the xmmword ptr [rip + .LCPI0_0] operand too, if the hardware likes that and if vpcmpeqb can take operands in memory. Alternatively, we could do a not edi at the beginning or a not rax at the end. These last two ideas I tried in my Zig code. When you change const x = a; in the above function to const x = ~a;, it gives this assembly:

foo:
        not     dil
        vpxor   xmm1, xmm1, xmm1
        vmovd   xmm0, edi
        vpbroadcastb    xmm0, xmm0
        vpand   xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
        vpcmpeqb        xmm0, xmm0, xmm1
        vpcmpeqd        xmm1, xmm1, xmm1
        vpxor   xmm0, xmm0, xmm1
        vmovq   rax, xmm0

Obviously, this is a double-negative, although I don't know what the internal representation looks like in the LLVM passes, so maybe it's difficult for the compiler to see this. But this should give the optimal emit below.

When you make any of the following changes, (invert the bitstring at the end, invert the condition, or swap 255 with 0) you do get optimal emit:

-    return @as(u64, @bitCast(@select(
+    return ~@as(u64, @bitCast(@select(
        u8,
-       (splatted & unique_bytes) == unique_bytes,
+       (splatted & unique_bytes) != unique_bytes,

-        @as(@Vector(8, u8), @splat(255)),
+       @as(@Vector(8, u8), @splat(0)),

-       @as(@Vector(8, u8), @splat(0)),
+       @as(@Vector(8, u8), @splat(255)),

    )));

optimal emit:

foo:
        vmovd   xmm0, edi
        vpxor   xmm1, xmm1, xmm1
        vpbroadcastb    xmm0, xmm0
        vpand   xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
        vpcmpeqb        xmm0, xmm0, xmm1
        vmovq   rax, xmm0

However, this will give me the inverse of what I want. So there could be a not at the beginning or end. We also should be able to fold that not into whatever comes next. E.g., imagine a scenario like:

-export fn foo(a: u8) u64 {
+export fn foo(a: u8, b: u64) u64 {
    const x = a;
    const unique_bytes: @Vector(8, u8) = [_]u8{ 0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01 };
    const splatted = @as(@Vector(8, u8), @splat(x));
-   return @as(u64, @bitCast(@select(
+   return b & @as(u64, @bitCast(@select(
        u8,
        (splatted & unique_bytes) == unique_bytes,
        @as(@Vector(8, u8), @splat(255)),
        @as(@Vector(8, u8), @splat(0)),
    )));
}

In this case, we could do an andn with b at the end instead of inverting the vector or the rax register or the other ideas I already mentioned. I'm not here to tell you which of those options are best for the hardware, but I have given a bunch of options to eliminate instructions.

Godbolt playground

In my real code, I actually use the return value and its inverse.

fn bar(x: u8) @Vector(8, u8) {
    const byte_indices: u64 = 0x0706050403020100;
    const unique_bytes: @Vector(8, u8) = [_]u8{ 0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01 };
    const splatted = @as(@Vector(8, u8), @splat(x));
    const vec = @as(u64, @bitCast(@select(
        u8,
        (splatted & unique_bytes) == unique_bytes,
        @as(@Vector(8, u8), @splat(255)),
        @as(@Vector(8, u8), @splat(0)),
    )));
    return @bitCast(pdep(byte_indices, vec) | pdep(byte_indices, ~vec));
}

Currently, LLVM produces double-negative code for this type of usage.

bar:
        vmovd   xmm0, edi
        vpxor   xmm1, xmm1, xmm1
        movabs  rcx, 506097522914230528
        vpbroadcastb    xmm0, xmm0
        vpand   xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
        vpcmpeqb        xmm0, xmm0, xmm1
        vpcmpeqd        xmm1, xmm1, xmm1
        vpxor   xmm0, xmm0, xmm1
        vmovq   rax, xmm0
        pdep    rdx, rcx, rax
        not     rax
        pdep    rax, rcx, rax
        or      rax, rdx
        vmovq   xmm0, rax

If I invert the check for == unique_bytes to != unique_bytes, it gives me the optimal emit (which does exactly what I want).

bar:
        vmovd   xmm0, edi
        vpxor   xmm1, xmm1, xmm1
        movabs  rcx, 506097522914230528
        vpbroadcastb    xmm0, xmm0
        vpand   xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
        vpcmpeqb        xmm0, xmm0, xmm1
        vmovq   rax, xmm0
        pdep    rdx, rcx, rax
        not     rax
        pdep    rax, rcx, rax
        or      rax, rdx
        vmovq   xmm0, rax
        ret

Godbolt link

I am using inline assembly for the pdep operation, since Zig currently does not have a way of accessing it in the LLVM-blessed way. Not sure if that throws a wrench in the optimization, but I think it should not affect it because all that's relevant is that we are consuming both vec and ~vec.

Thank you to all LLVM contributors!

‒ Validark

@RKSimon
Copy link
Collaborator

RKSimon commented Jan 24, 2024

IR:

define i64 @foo(i8 zeroext %0) {
Entry:
  %1 = insertelement <1 x i8> poison, i8 %0, i64 0
  %2 = shufflevector <1 x i8> %1, <1 x i8> poison, <8 x i32> zeroinitializer
  %3 = and <8 x i8> %2, <i8 -128, i8 64, i8 32, i8 16, i8 8, i8 4, i8 2, i8 1>
  %.not = icmp ne <8 x i8> %3, zeroinitializer
  %4 = sext <8 x i1> %.not to <8 x i8>
  %5 = bitcast <8 x i8> %4 to i64
  ret i64 %5
}

@RKSimon RKSimon self-assigned this Jan 24, 2024
@llvmbot
Copy link
Member

llvmbot commented Jan 24, 2024

@llvm/issue-subscribers-backend-x86

Author: Niles Salter (Validark)

So I wrote some code that turns 8 bits and expands each bit into 8 bits (all 1's or all 0's) like so:
┌────────┬────────┬────────┬────────┬────────┬────────┬────────┬────────┐
│    0   │    0   │    1   │    0   │    0   │    1   │    1   │    0   │
├──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┤
│00000000│00000000│11111111│00000000│00000000│11111111│11111111│00000000│
└────────┴────────┴────────┴────────┴────────┴────────┴────────┴────────┘

Now, I did write a version that works via pdep(m, 0x0101010101010101) * 255 and also a SWAR implementation. But I also wanted to try a version that works via vectors. Here is what I wrote for that:

fn foo(a: u8) u64 {
    const x = a;
    const unique_bytes: @<!-- -->Vector(8, u8) = [_]u8{ 0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01 };
    const splatted = @<!-- -->as(@<!-- -->Vector(8, u8), @<!-- -->splat(x));
    return @<!-- -->as(u64, @<!-- -->bitCast(@<!-- -->select(
        u8,
        (splatted &amp; unique_bytes) == unique_bytes,
        @<!-- -->as(@<!-- -->Vector(8, u8), @<!-- -->splat(255)),
        @<!-- -->as(@<!-- -->Vector(8, u8), @<!-- -->splat(0)),
    )));
}

For those who can't read Zig that well, here is a diagram depicting what I am doing. I broadcast 8 bits in a vector of 8 bytes, in this case represented as abcdefgh, and isolate a single bit in each byte. Then, for each byte where the isolated bit is present, I want that to turn into a byte of all one's, otherwise, all zeros. To do that, I do an equal-each-comparison to the vector I just used to isolate each bit. In Zig, that gives me a vector of booleans, so I do a @<!-- -->select which should give me all one's corresponding to a true, and all zeros for a false. That should map to what the hardware already does anyway, so should be a no-op.

┌────────┬────────┬────────┬────────┬────────┬────────┬────────┬────────┐
│abcdefgh│abcdefgh│abcdefgh│abcdefgh│abcdefgh│abcdefgh│abcdefgh│abcdefgh│
├───&amp;────┼───&amp;────┼───&amp;────┼───&amp;────┼───&amp;────┼───&amp;────┼───&amp;────┼───&amp;────┤
│10000000│01000000│00100000│00010000│00001000│00000100│00000010│00000001│
├──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┤
│a0000000│0b000000│00c00000│000d0000│0000e000│00000f00│000000g0│0000000h│
├───≡────┼───≡────┼───≡────┼───≡────┼───≡────┼───≡────┼───≡────┼───≡────┤
│10000000│01000000│00100000│00010000│00001000│00000100│00000010│00000001│
├──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┼──↓↓↓───┤
│aaaaaaaa│bbbbbbbb│cccccccc│dddddddd│eeeeeeee│ffffffff│gggggggg│hhhhhhhh│
└────────┴────────┴────────┴────────┴────────┴────────┴────────┴────────┘

Currently, that maps to this assembly for x86-64 Zen 3:

.LCPI0_0:
        .byte   128
        .byte   64
        .byte   32
        .byte   16
        .byte   8
        .byte   4
        .byte   2
        .byte   1
        .zero   1
        .zero   1
        .zero   1
        .zero   1
        .zero   1
        .zero   1
        .zero   1
        .zero   1
foo:
        vmovd   xmm0, edi
        vpxor   xmm1, xmm1, xmm1
        vpbroadcastb    xmm0, xmm0
        vpand   xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
        vpcmpeqb        xmm0, xmm0, xmm1
        vpcmpeqd        xmm1, xmm1, xmm1
        vpxor   xmm0, xmm0, xmm1
        vmovq   rax, xmm0

You can see the compiler wanted to optimize away using that unique_bytes vector twice, and changed (splatted &amp; unique_bytes) == unique_bytes to be (splatted &amp; unique_bytes) != @<!-- -->splat(0). However, there is no vpcmpneqb/vector-not-equal on Zen 3 hardware. (We do have it with AVX512-enabled ISA's but we avoid it because it's slower for the hardware.) To emulate that, our instructions instead do ((splatted &amp; unique_bytes) == @<!-- -->splat(0)) and then xor it with all ones to flip all the bits (vpcmpeqd+vpxor in assembly).

This assembly does not seem optimal to me. I think we could instead read LCPI0_0 into xmm1, and eliminate the need to read it in the vpand instruction, or we could give vpcmpeqb the xmmword ptr [rip + .LCPI0_0] operand too, if the hardware likes that and if vpcmpeqb can take operands in memory. Alternatively, we could do a not edi at the beginning or a not rax at the end. These last two ideas I tried in my Zig code. When you change const x = a; in the above function to const x = ~a;, it gives this assembly:

foo:
        not     dil
        vpxor   xmm1, xmm1, xmm1
        vmovd   xmm0, edi
        vpbroadcastb    xmm0, xmm0
        vpand   xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
        vpcmpeqb        xmm0, xmm0, xmm1
        vpcmpeqd        xmm1, xmm1, xmm1
        vpxor   xmm0, xmm0, xmm1
        vmovq   rax, xmm0

Obviously, this is a double-negative, although I don't know what the internal representation looks like in the LLVM passes, so maybe it's difficult for the compiler to see this. But this should give the optimal emit below.

When you make any of the following changes, (invert the bitstring at the end, invert the condition, or swap 255 with 0) you do get optimal emit:

-    return @<!-- -->as(u64, @<!-- -->bitCast(@<!-- -->select(
+    return ~@<!-- -->as(u64, @<!-- -->bitCast(@<!-- -->select(
        u8,
-       (splatted &amp; unique_bytes) == unique_bytes,
+       (splatted &amp; unique_bytes) != unique_bytes,

-        @<!-- -->as(@<!-- -->Vector(8, u8), @<!-- -->splat(255)),
+       @<!-- -->as(@<!-- -->Vector(8, u8), @<!-- -->splat(0)),

-       @<!-- -->as(@<!-- -->Vector(8, u8), @<!-- -->splat(0)),
+       @<!-- -->as(@<!-- -->Vector(8, u8), @<!-- -->splat(255)),

    )));

optimal emit:

foo:
        vmovd   xmm0, edi
        vpxor   xmm1, xmm1, xmm1
        vpbroadcastb    xmm0, xmm0
        vpand   xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
        vpcmpeqb        xmm0, xmm0, xmm1
        vmovq   rax, xmm0

However, this will give me the inverse of what I want. So there could be a not at the beginning or end. We also should be able to fold that not into whatever comes next. E.g., imagine a scenario like:

-export fn foo(a: u8) u64 {
+export fn foo(a: u8, b: u64) u64 {
    const x = a;
    const unique_bytes: @<!-- -->Vector(8, u8) = [_]u8{ 0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01 };
    const splatted = @<!-- -->as(@<!-- -->Vector(8, u8), @<!-- -->splat(x));
-   return @<!-- -->as(u64, @<!-- -->bitCast(@<!-- -->select(
+   return b &amp; @<!-- -->as(u64, @<!-- -->bitCast(@<!-- -->select(
        u8,
        (splatted &amp; unique_bytes) == unique_bytes,
        @<!-- -->as(@<!-- -->Vector(8, u8), @<!-- -->splat(255)),
        @<!-- -->as(@<!-- -->Vector(8, u8), @<!-- -->splat(0)),
    )));
}

In this case, we could do an andn with b at the end instead of inverting the vector or the rax register or the other ideas I already mentioned. I'm not here to tell you which of those options are best for the hardware, but I have given a bunch of options to eliminate instructions.

Godbolt playground

In my real code, I actually use the return value and its inverse.

fn bar(x: u8) @<!-- -->Vector(8, u8) {
    const byte_indices: u64 = 0x0706050403020100;
    const unique_bytes: @<!-- -->Vector(8, u8) = [_]u8{ 0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01 };
    const splatted = @<!-- -->as(@<!-- -->Vector(8, u8), @<!-- -->splat(x));
    const vec = @<!-- -->as(u64, @<!-- -->bitCast(@<!-- -->select(
        u8,
        (splatted &amp; unique_bytes) == unique_bytes,
        @<!-- -->as(@<!-- -->Vector(8, u8), @<!-- -->splat(255)),
        @<!-- -->as(@<!-- -->Vector(8, u8), @<!-- -->splat(0)),
    )));
    return @<!-- -->bitCast(pdep(byte_indices, vec) | pdep(byte_indices, ~vec));
}

Currently, LLVM produces double-negative code for this type of usage.

bar:
        vmovd   xmm0, edi
        vpxor   xmm1, xmm1, xmm1
        movabs  rcx, 506097522914230528
        vpbroadcastb    xmm0, xmm0
        vpand   xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
        vpcmpeqb        xmm0, xmm0, xmm1
        vpcmpeqd        xmm1, xmm1, xmm1
        vpxor   xmm0, xmm0, xmm1
        vmovq   rax, xmm0
        pdep    rdx, rcx, rax
        not     rax
        pdep    rax, rcx, rax
        or      rax, rdx
        vmovq   xmm0, rax

If I invert the check for == unique_bytes to != unique_bytes, it gives me the optimal emit (which does exactly what I want).

bar:
        vmovd   xmm0, edi
        vpxor   xmm1, xmm1, xmm1
        movabs  rcx, 506097522914230528
        vpbroadcastb    xmm0, xmm0
        vpand   xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
        vpcmpeqb        xmm0, xmm0, xmm1
        vmovq   rax, xmm0
        pdep    rdx, rcx, rax
        not     rax
        pdep    rax, rcx, rax
        or      rax, rdx
        vmovq   xmm0, rax
        ret

Godbolt link

I am using inline assembly for the pdep operation, since Zig currently does not have a way of accessing it in the LLVM-blessed way. Not sure if that throws a wrench in the optimization, but I think it should not affect it because all that's relevant is that we are consuming both vec and ~vec.

Thank you to all LLVM contributors!

‒ Validark

@mikaelholmen
Copy link
Collaborator

mikaelholmen commented Jan 26, 2024

The commit 72f10f7 was reverted in b9483d3 so I guess this should be reopened.

It would of course have been nice if github would have picked up that the fix was reverted but...

The reverting commit talks about a miscompile.
I've seen a miscompile with the fix 72f10f7 in my downstream testing as well, I hope it's the same problem.

EDIT:
Reproducer for "my" miscompile can be seen here:
72f10f7#commitcomment-137843601

@RKSimon RKSimon reopened this Jan 26, 2024
@RKSimon
Copy link
Collaborator

RKSimon commented Jan 26, 2024

Thanks @mikaelholmen I'm on PTO so asked @dyung to revert until I could take a proper look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants