-
Notifications
You must be signed in to change notification settings - Fork 13.5k
Reassociate pass serializes vector comparison results #64840
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The magic code is:
Could this be as simple as changing to |
Hi! This issue may be a good introductory issue for people new to working on LLVM. If you would like to work on this issue, your first steps are:
If you have any further questions about this issue, don't hesitate to ask via a comment in the thread below. |
@llvm/issue-subscribers-good-first-issue Author: Simon Pilgrim (RKSimon)
https://godbolt.org/z/Kh7667Yxo
Pulled out of Issue #63946 While the scalar code for an accumulative or chain of comparison results stays as a binary tree: define i1 @<!-- -->scalar(i32 %a, i32 %b0, i32 %b1, i32 %b2, i32 %b3, i32 %b4, i32 %b5, i32 %b6, i32 %b7) local_unnamed_addr {
entry:
%cmp0 = icmp eq i32 %b0, %a
%cmp1 = icmp eq i32 %b1, %a
%cmp2 = icmp eq i32 %b2, %a
%cmp3 = icmp eq i32 %b3, %a
%cmp4 = icmp eq i32 %b4, %a
%cmp5 = icmp eq i32 %b5, %a
%cmp6 = icmp eq i32 %b6, %a
%cmp7 = icmp eq i32 %b7, %a
%or01 = or i1 %cmp0, %cmp1
%or23 = or i1 %cmp2, %cmp3
%or45 = or i1 %cmp4, %cmp5
%or67 = or i1 %cmp6, %cmp7
%or0123 = or i1 %or01, %or23
%or4567 = or i1 %or45, %or67
%or01234567 = or i1 %or0123, %or4567
ret i1 %or01234567
} The vector code equivalent gets linearised, resulting in a much deeper serial chain, affecting IPC and making it more likely we will hit value tracking recursion depth limits: define <8 x i1> @<!-- -->vector(<8 x i32> %a, <8 x i32> %b0, <8 x i32> %b1, <8 x i32> %b2, <8 x i32> %b3, <8 x i32> %b4, <8 x i32> %b5, <8 x i32> %b6, <8 x i32> %b7) local_unnamed_addr {
entry:
%cmp0 = icmp eq <8 x i32> %b0, %a
%cmp1 = icmp eq <8 x i32> %b1, %a
%cmp2 = icmp eq <8 x i32> %b2, %a
%cmp3 = icmp eq <8 x i32> %b3, %a
%cmp4 = icmp eq <8 x i32> %b4, %a
%cmp5 = icmp eq <8 x i32> %b5, %a
%cmp6 = icmp eq <8 x i32> %b6, %a
%cmp7 = icmp eq <8 x i32> %b7, %a
%or01 = or <8 x i1> %cmp0, %cmp1
%or23 = or <8 x i1> %cmp2, %cmp3
%or45 = or <8 x i1> %cmp4, %cmp5
%or67 = or <8 x i1> %cmp6, %cmp7
%or0123 = or <8 x i1> %or01, %or23
%or4567 = or <8 x i1> %or45, %or67
%or01234567 = or <8 x i1> %or0123, %or4567
ret <8 x i1> %or01234567
} -> define <8 x i1> @<!-- -->vector(<8 x i32> %a, <8 x i32> %b0, <8 x i32> %b1, <8 x i32> %b2, <8 x i32> %b3, <8 x i32> %b4, <8 x i32> %b5, <8 x i32> %b6, <8 x i32> %b7) local_unnamed_addr {
%cmp0 = icmp eq <8 x i32> %b0, %a
%cmp1 = icmp eq <8 x i32> %b1, %a
%cmp2 = icmp eq <8 x i32> %b2, %a
%cmp3 = icmp eq <8 x i32> %b3, %a
%cmp4 = icmp eq <8 x i32> %b4, %a
%cmp5 = icmp eq <8 x i32> %b5, %a
%cmp6 = icmp eq <8 x i32> %b6, %a
%cmp7 = icmp eq <8 x i32> %b7, %a
%or67 = or <8 x i1> %cmp1, %cmp0
%or45 = or <8 x i1> %or67, %cmp2
%or4567 = or <8 x i1> %or45, %cmp3
%or23 = or <8 x i1> %or4567, %cmp4
%or01 = or <8 x i1> %or23, %cmp5
%or0123 = or <8 x i1> %or01, %cmp6
%or01234567 = or <8 x i1> %or0123, %cmp7
ret <8 x i1> %or01234567
} |
@dtcxzyw A small warning - there looks to be no test coverage for this - so this ticket might end up being rather lengthy - adding useful scalar AND vector test coverage, performance testing, ..... |
@RKSimon Is this task still available? I'd be happy to work on it if it is. |
|
@RKSimon Sorry for the delay. I noticed a difference in vector behavior with -passes=reassociate. When I perform the same operation as in the example you provided |
@SahilPatidar Were you able to create some suitable tests ? |
I tried some tests similar to the ones you mentioned, but with some modifications. |
@RKSimon, What is the next step? |
2 things in parallel
|
@RKSimon, I apologize for the delayed reply; I've been swamped with my GSoC project. Currently, I'm trying to build the test-suite but I'm facing a problem:
cmake config:
|
@SahilPatidar Are you still looking at this at all please? |
I've run the test-suite on 3 variants:
|
The numbers look favorable, although both the existing i1 handling and enabling vXi1 handling as well can cause some smaller regressions - I think overall the change is worth it, but without knowing a lot more about the reassociation pass I don't know of anything else we can try to improve things further. |
Extends what we already do for i1 types and don't serialize vXi1 logical expressions to improve ILP. llvm-test-suite numbers #64840 (comment) indicate that both reassociations are a net win. Fixes #64840 Fixes #63946
…#123329) Extends what we already do for i1 types and don't serialize vXi1 logical expressions to improve ILP. llvm-test-suite numbers llvm/llvm-project#64840 (comment) indicate that both reassociations are a net win. Fixes #64840 Fixes #63946
https://godbolt.org/z/Kh7667Yxo
Pulled out of Issue #63946
While the scalar code for an accumulative or chain of comparison results stays as a binary tree:
The vector code equivalent gets linearised, resulting in a much deeper serial chain, affecting IPC and making it more likely we will hit value tracking recursion depth limits:
->
The text was updated successfully, but these errors were encountered: