Skip to content

Missed optimization for -(foo < 0) in Clang when foo was previously sign-extended and in LLVM for all cases. #57381

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dead-claudia opened this issue Aug 26, 2022 · 5 comments

Comments

@dead-claudia
Copy link

dead-claudia commented Aug 26, 2022

C source: https://godbolt.org/z/aMTaGdTof (x86-64), https://godbolt.org/z/PM4hbPd3d (32-bit ARM), https://godbolt.org/z/dbaaYaG8n (AArch64)
LLVM output from above (with metadata stripped, generated from the C source): https://godbolt.org/z/6zG5dW4o1 (x86-64)

It's always safe to convert negate(if (foo < 0) then 1 else 0) to signed_shift_right(foo, (bit_width of foo) - 1) for any two's complement signed integer foo, regardless of width, and Clang already leverages this absent sign-extending: https://godbolt.org/z/ad3sxchKd (x86-64), https://godbolt.org/z/qTja8364h (32-bit ARM), https://godbolt.org/z/f74c8d1Y9 (AArch64). LLVM doesn't at all per https://godbolt.org/z/97zh4TEav (hand-written, targeting x86-64), and Clang consistently fails to make this optimization when the operand is previously sign-extended.

(It's odd this is implemented in Clang and not LLVM, as it's a trivial peephole optimization that could feasibly be done even in an SSA model.)

Here's the C source from the above link for quick reference (Edit: remove redundant headers from here specifically):

#define SOME_CONSTANT 1000

struct foo_t {
    short baz;
};

int sub_load_mask_1(struct foo_t *foo) {
    int baz = foo->baz;
    return (SOME_CONSTANT - baz) & -(baz < 0);
}

int sub_load_mask_2(struct foo_t *foo) {
    int baz = foo->baz;
    return (SOME_CONSTANT - baz) & (baz >> 31);
}

int sub_imm_mask_1(short baz) {
    return (SOME_CONSTANT - baz) & -(baz < 0);
}

int sub_imm_mask_2(short baz) {
    return (SOME_CONSTANT - baz) & (baz >> 31);
}

int sub_reg_mask_1(int k, short baz) {
    return (k - baz) & -(baz < 0);
}

int sub_reg_mask_2(int k, short baz) {
    return (k - baz) & ((int)baz >> 31);
}

int sub_simple_mask_1(int k, short baz) {
    return k & -(baz < 0);
}

int sub_simple_mask_2(int k, short baz) {
    return k & ((int)baz >> 31);
}

int simple_mask_1(int k, short baz) {
    return -(baz < 0);
}

int simple_mask_2(int k, short baz) {
    return ((int)baz >> 31);
}

All the transforms seem to be valid according to alive:

Also worth noting that ARM (especially its 32-bit variant) would benefit from this noticeably more than x86-64, as the links detail, thanks to its barrel shifter.

  • ARM 32-bit:
    • sub_load_mask_*: 7 instructions to 3 instructions (ldrsh followed by the sub_imm_mask_* equivalent)
    • sub_imm_mask_*: 7 instructions to just rsb rT, rA, #imm + and rD, rT, rB asr #31
    • sub_reg_mask_*: 7 instructions to just sub rT, rA, rB asr #15 + and rD, rT, rB asr #31
    • sub_simple_mask_*: 6 instructions to just and rD, rA, rB asr #15
    • simple_mask_*: 5 instructions to just asr rD, rA asr #15
  • AArch64:
    • sub_load_mask_*: 7 instructions to 4 instructions
    • sub_imm_mask_*: 5 instructions to 4 instructions
    • sub_reg_mask_*: 4 instructions to 3 instructions
    • sub_simple_mask_*: 3 instructions to just and rT, rA, #0x8000 + neg rT, rA asr #15
    • simple_mask_*: 2 instructions to just sbfx rD, rA, #15, #1
@dead-claudia
Copy link
Author

Also, of course, it'd result in smaller WebAssembly output as well: https://godbolt.org/z/sfvnoxG8q (C) https://godbolt.org/z/qPr5Pf5Pf (LLVM)

@rotateright
Copy link
Contributor

The shift+*ext patterns were chosen as canonical in IR near the dawn of LLVM. Someday, we should see if changing that to "*ext (x < 0)" (icmp+ext) makes more sense after all these years. Barring that, it looks like a small patch to InstCombine::Negator() would catch the simple case (and that should unlock the larger optimizations).

@rotateright rotateright self-assigned this Aug 26, 2022
rotateright added a commit that referenced this issue Aug 27, 2022
0 - (zext (i8 X u>> 7) to iN) --> sext (i8 X s>> 7) to iN

https://alive2.llvm.org/ce/z/jzv4Ud

This is part of solving issue #57381.
@rotateright
Copy link
Contributor

I was too optimistic in thinking 1 patch would trigger the other transforms that we want. I haven't looked through to codegen yet, but this is hopefully an improvement:
c6e5602

@rotateright
Copy link
Contributor

I think we have the IR in shape after that last commit. Can you confirm that the assembly is what you expect for all targets/examples?

@rotateright
Copy link
Contributor

Another pair of related patches:
246078604c87
6c39a3aae1dc

There's at least one more lshr-of-signbit transform we could do, but it requires changing a conflicting fold to avoid an infinite compiler loop.

AFAICT, we now have the ideal IR and codegen on the examples shown here, so I'm going to close the report.

Feel free to re-open or file another bug if there are still problems. Thanks for the detailed analysis and excellent code examples!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants