Skip to content

Improve IR for code which finds position of highest bit #43471

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
davidbolvansky opened this issue Nov 23, 2019 · 6 comments
Closed

Improve IR for code which finds position of highest bit #43471

davidbolvansky opened this issue Nov 23, 2019 · 6 comments
Labels
backend:X86 bugzilla Issues migrated from bugzilla

Comments

@davidbolvansky
Copy link
Collaborator

davidbolvansky commented Nov 23, 2019

Bugzilla Link 44126
Version trunk
OS Linux
CC @topperc,@hfinkel,@LebedevRI,@rotateright

Extended Description

https://github.com/facebook/zstd/blob/47034cd6c31125fdba3155abe9a618f580b4f3eb/programs/fileio.c#L1789

unsigned long long FIO_highbit64(unsigned long long v)
{
    unsigned  count = 0;
    v >>= 1;
    while (v) { v >>= 1; count++; }
    return count;
}

should be same as:

unsigned long long FIO_highbit64a(unsigned long long v)
{

    return 63 - __builtin_clzll(v);
}

But first version has worse IR and codegen:

define dso_local i64 @_Z13FIO_highbit64y(i64 %0) local_unnamed_addr #0 {
  %2 = lshr i64 %0, 1
  %3 = call i64 @llvm.ctlz.i64(i64 %2, i1 false), !range !2
  %4 = sub nuw nsw i64 64, %3
  ret i64 %4
}

=>

define dso_local i64 @_Z14FIO_highbit64ay(i64 %0) local_unnamed_addr #1 {
  %2 = tail call i64 @llvm.ctlz.i64(i64 %0, i1 true), !range !3
  %3 = xor i64 %2, 63
  ret i64 %3
}

It would be good to not forget on trunc variant:

unsigned FIO_highbit64(unsigned long long v)
{
    unsigned count = 0;
    v >>= 1;
    while (v) { v >>= 1; count++; }
    return count;
}
define dso_local i32 @_Z13FIO_highbit64y(i64 %0) local_unnamed_addr #0 {
  %2 = lshr i64 %0, 1
  %3 = call i64 @llvm.ctlz.i64(i64 %2, i1 false), !range !2
  %4 = trunc i64 %3 to i32
  %5 = sub nsw i32 64, %4
  ret i32 %5
}

=>

define dso_local i32 @_Z13FIO_highbit64y(i64 %0) local_unnamed_addr #0 {
  %2 = tail call i64 @llvm.cttz.i64(i64 %0, i1 true), !range !2
  %3 = trunc i64 %2 to i32
  %4 = xor i32 %3, 63
  ret i32 %4
}
FIO_highbit64(unsigned long long):
        shr     rdi
        je      .LBB0_1
        bsr     rcx, rdi
        xor     rcx, 63
        mov     eax, 64
        sub     eax, ecx
        ret
.LBB0_1:
        mov     ecx, 64
        mov     eax, 64
        sub     eax, ecx
        ret

vs:

FIO_highbit64(unsigned long long):
        bsf     rax, rdi
        xor     eax, 63
        ret
@davidbolvansky
Copy link
Collaborator Author

davidbolvansky commented Nov 23, 2019

  • uh, ctlz, not cttz.
define dso_local i32 @_Z13FIO_highbit64y(i64 %0) local_unnamed_addr #0 {
  %2 = tail call i64 @llvm.ctlz.i64(i64 %0, i1 true), !range !2
  %3 = trunc i64 %2 to i32
  %4 = xor i32 %3, 63
  ret i32 %4
}

@topperc
Copy link
Collaborator

topperc commented Nov 23, 2019

Isn't the transformed version more poisonous than the original version

define dso_local i64 @_Z13FIO_highbit64y(i64 %0) local_unnamed_addr #0 {
  %2 = lshr i64 %0, 1
  %3 = call i64 @llvm.ctlz.i64(i64 %2, i1 false), !range !2
  %4 = sub nuw nsw i64 64, %3
  ret i64 %4
}

=>

define dso_local i64 @_Z14FIO_highbit64ay(i64 %0) local_unnamed_addr #1 {
  %2 = tail call i64 @llvm.ctlz.i64(i64 %0, i1 true), !range !3
  %3 = xor i64 %2, 63
  ret i64 %3
}

If %0 is 0 the original code has a defined answer. The transformed code produces poison.

@davidbolvansky
Copy link
Collaborator Author

davidbolvansky commented Nov 23, 2019

Right.

But still, even with zero check, this code:

FIO_highbit64(unsigned long long):
        test    rdi, rdi
        je      .LBB0_1
        bsr     rax, rdi
        ret
.LBB0_1:
        xor     eax, eax
        ret

is better then current code.

But with -march=haswell, situation is more interesting:

FIO_highbit64_clz(unsigned long long): 
        lzcnt   rcx, rdi
        xor     ecx, 63
        xor     eax, eax
        test    rdi, rdi
        cmovne  eax, ecx
        ret
FIO_highbit64_loop(unsigned long long):  // better
        lzcnt   rcx, rdi
        mov     eax, 64
        sub     eax, ecx
        ret

https://godbolt.org/z/56wdS-

@topperc
Copy link
Collaborator

topperc commented Nov 23, 2019

This might be yet another case where we should consider doing this.

 bool X86TargetLowering::isCheapToSpeculateCttz() const {
   // Speculate cttz only if we can directly use TZCNT.
-  return Subtarget.hasBMI();
+  return Subtarget.hasBMI() || Subtarget.hasCMov();
 }
 
 bool X86TargetLowering::isCheapToSpeculateCtlz() const {
   // Speculate ctlz only if we can directly use LZCNT.
-  return Subtarget.hasLZCNT();
+  return Subtarget.hasLZCNT() || Subtarget.hasCMov();
 }

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021
@llvmbot
Copy link
Member

llvmbot commented Aug 22, 2024

@llvm/issue-subscribers-backend-x86

Author: Dávid Bolvanský (davidbolvansky)

| | | | --- | --- | | Bugzilla Link | [44126](https://llvm.org/bz44126) | | Version | trunk | | OS | Linux | | CC | @topperc,@hfinkel,@LebedevRI,@rotateright |

Extended Description

https://github.com/facebook/zstd/blob/47034cd6c31125fdba3155abe9a618f580b4f3eb/programs/fileio.c#L1789

unsigned long long FIO_highbit64(unsigned long long v)
{
    unsigned  count = 0;
    v >>= 1;
    while (v) { v >>= 1; count++; }
    return count;
}

should be same as:

unsigned long long FIO_highbit64a(unsigned long long v)
{

    return 63 - __builtin_clzll(v);
}

But first version has worse IR and codegen:

define dso_local i64 @<!-- -->_Z13FIO_highbit64y(i64 %0) local_unnamed_addr #<!-- -->0 {
  %2 = lshr i64 %0, 1
  %3 = call i64 @<!-- -->llvm.ctlz.i64(i64 %2, i1 false), !range !2
  %4 = sub nuw nsw i64 64, %3
  ret i64 %4
}

=&gt;

define dso_local i64 @<!-- -->_Z14FIO_highbit64ay(i64 %0) local_unnamed_addr #<!-- -->1 {
  %2 = tail call i64 @<!-- -->llvm.ctlz.i64(i64 %0, i1 true), !range !3
  %3 = xor i64 %2, 63
  ret i64 %3
}

It would be good to not forget on trunc variant:

unsigned FIO_highbit64(unsigned long long v)
{
    unsigned count = 0;
    v &gt;&gt;= 1;
    while (v) { v &gt;&gt;= 1; count++; }
    return count;
}
define dso_local i32 @<!-- -->_Z13FIO_highbit64y(i64 %0) local_unnamed_addr #<!-- -->0 {
  %2 = lshr i64 %0, 1
  %3 = call i64 @<!-- -->llvm.ctlz.i64(i64 %2, i1 false), !range !2
  %4 = trunc i64 %3 to i32
  %5 = sub nsw i32 64, %4
  ret i32 %5
}

=&gt;

define dso_local i32 @<!-- -->_Z13FIO_highbit64y(i64 %0) local_unnamed_addr #<!-- -->0 {
  %2 = tail call i64 @<!-- -->llvm.cttz.i64(i64 %0, i1 true), !range !2
  %3 = trunc i64 %2 to i32
  %4 = xor i32 %3, 63
  ret i32 %4
}
FIO_highbit64(unsigned long long):
        shr     rdi
        je      .LBB0_1
        bsr     rcx, rdi
        xor     rcx, 63
        mov     eax, 64
        sub     eax, ecx
        ret
.LBB0_1:
        mov     ecx, 64
        mov     eax, 64
        sub     eax, ecx
        ret

vs:

FIO_highbit64(unsigned long long):
        bsf     rax, rdi
        xor     eax, 63
        ret

@RKSimon
Copy link
Collaborator

RKSimon commented Sep 24, 2024

This was fixed by #102885 / f673882

@RKSimon RKSimon closed this as completed Sep 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:X86 bugzilla Issues migrated from bugzilla
Projects
None yet
Development

No branches or pull requests

4 participants