Skip to content

Reassociate pass serializes vector comparison results #64840

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
RKSimon opened this issue Aug 20, 2023 · 16 comments · Fixed by #123329
Closed

Reassociate pass serializes vector comparison results #64840

RKSimon opened this issue Aug 20, 2023 · 16 comments · Fixed by #123329
Assignees
Labels
good first issue https://github.com/llvm/llvm-project/contribute llvm:transforms missed-optimization

Comments

@RKSimon
Copy link
Collaborator

RKSimon commented Aug 20, 2023

https://godbolt.org/z/Kh7667Yxo

Pulled out of Issue #63946

While the scalar code for an accumulative or chain of comparison results stays as a binary tree:

define i1 @scalar(i32 %a, i32 %b0, i32 %b1, i32 %b2, i32 %b3, i32 %b4, i32 %b5, i32 %b6, i32 %b7) local_unnamed_addr {
entry:
  %cmp0 = icmp eq i32 %b0, %a
  %cmp1 = icmp eq i32 %b1, %a
  %cmp2 = icmp eq i32 %b2, %a
  %cmp3 = icmp eq i32 %b3, %a
  %cmp4 = icmp eq i32 %b4, %a
  %cmp5 = icmp eq i32 %b5, %a
  %cmp6 = icmp eq i32 %b6, %a
  %cmp7 = icmp eq i32 %b7, %a
  %or01 = or i1 %cmp0, %cmp1
  %or23 = or i1 %cmp2, %cmp3
  %or45 = or i1 %cmp4, %cmp5
  %or67 = or i1 %cmp6, %cmp7
  %or0123 = or i1 %or01, %or23
  %or4567 = or i1 %or45, %or67
  %or01234567 = or i1 %or0123, %or4567
  ret i1 %or01234567
}

The vector code equivalent gets linearised, resulting in a much deeper serial chain, affecting IPC and making it more likely we will hit value tracking recursion depth limits:

define <8 x i1> @vector(<8 x i32> %a, <8 x i32> %b0, <8 x i32> %b1, <8 x i32> %b2, <8 x i32> %b3, <8 x i32> %b4, <8 x i32> %b5, <8 x i32> %b6, <8 x i32> %b7) local_unnamed_addr {
entry:
  %cmp0 = icmp eq <8 x i32> %b0, %a
  %cmp1 = icmp eq <8 x i32> %b1, %a
  %cmp2 = icmp eq <8 x i32> %b2, %a
  %cmp3 = icmp eq <8 x i32> %b3, %a
  %cmp4 = icmp eq <8 x i32> %b4, %a
  %cmp5 = icmp eq <8 x i32> %b5, %a
  %cmp6 = icmp eq <8 x i32> %b6, %a
  %cmp7 = icmp eq <8 x i32> %b7, %a
  %or01 = or <8 x i1> %cmp0, %cmp1
  %or23 = or <8 x i1> %cmp2, %cmp3
  %or45 = or <8 x i1> %cmp4, %cmp5
  %or67 = or <8 x i1> %cmp6, %cmp7
  %or0123 = or <8 x i1> %or01, %or23
  %or4567 = or <8 x i1> %or45, %or67
  %or01234567 = or <8 x i1> %or0123, %or4567
  ret <8 x i1> %or01234567
}

->

define <8 x i1> @vector(<8 x i32> %a, <8 x i32> %b0, <8 x i32> %b1, <8 x i32> %b2, <8 x i32> %b3, <8 x i32> %b4, <8 x i32> %b5, <8 x i32> %b6, <8 x i32> %b7) local_unnamed_addr {
  %cmp0 = icmp eq <8 x i32> %b0, %a
  %cmp1 = icmp eq <8 x i32> %b1, %a
  %cmp2 = icmp eq <8 x i32> %b2, %a
  %cmp3 = icmp eq <8 x i32> %b3, %a
  %cmp4 = icmp eq <8 x i32> %b4, %a
  %cmp5 = icmp eq <8 x i32> %b5, %a
  %cmp6 = icmp eq <8 x i32> %b6, %a
  %cmp7 = icmp eq <8 x i32> %b7, %a
  %or67 = or <8 x i1> %cmp1, %cmp0
  %or45 = or <8 x i1> %or67, %cmp2
  %or4567 = or <8 x i1> %or45, %cmp3
  %or23 = or <8 x i1> %or4567, %cmp4
  %or01 = or <8 x i1> %or23, %cmp5
  %or0123 = or <8 x i1> %or01, %cmp6
  %or01234567 = or <8 x i1> %or0123, %cmp7
  ret <8 x i1> %or01234567
}
@RKSimon
Copy link
Collaborator Author

RKSimon commented Mar 6, 2024

The magic code is:

if (I->getType()->isIntegerTy(1))

Could this be as simple as changing to isIntOrIntVectorTy(1)?

@dtcxzyw dtcxzyw added the good first issue https://github.com/llvm/llvm-project/contribute label Mar 6, 2024
@llvmbot
Copy link
Member

llvmbot commented Mar 6, 2024

Hi!

This issue may be a good introductory issue for people new to working on LLVM. If you would like to work on this issue, your first steps are:

  1. In the comments of the issue, request for it to be assigned to you.
  2. Fix the issue locally.
  3. Run the test suite locally. Remember that the subdirectories under test/ create fine-grained testing targets, so you can e.g. use make check-clang-ast to only run Clang's AST tests.
  4. Create a Git commit.
  5. Run git clang-format HEAD~1 to format your changes.
  6. Open a pull request to the upstream repository on GitHub. Detailed instructions can be found in GitHub's documentation.

If you have any further questions about this issue, don't hesitate to ask via a comment in the thread below.

@llvmbot
Copy link
Member

llvmbot commented Mar 6, 2024

@llvm/issue-subscribers-good-first-issue

Author: Simon Pilgrim (RKSimon)

https://godbolt.org/z/Kh7667Yxo

Pulled out of Issue #63946

While the scalar code for an accumulative or chain of comparison results stays as a binary tree:

define i1 @<!-- -->scalar(i32 %a, i32 %b0, i32 %b1, i32 %b2, i32 %b3, i32 %b4, i32 %b5, i32 %b6, i32 %b7) local_unnamed_addr {
entry:
  %cmp0 = icmp eq i32 %b0, %a
  %cmp1 = icmp eq i32 %b1, %a
  %cmp2 = icmp eq i32 %b2, %a
  %cmp3 = icmp eq i32 %b3, %a
  %cmp4 = icmp eq i32 %b4, %a
  %cmp5 = icmp eq i32 %b5, %a
  %cmp6 = icmp eq i32 %b6, %a
  %cmp7 = icmp eq i32 %b7, %a
  %or01 = or i1 %cmp0, %cmp1
  %or23 = or i1 %cmp2, %cmp3
  %or45 = or i1 %cmp4, %cmp5
  %or67 = or i1 %cmp6, %cmp7
  %or0123 = or i1 %or01, %or23
  %or4567 = or i1 %or45, %or67
  %or01234567 = or i1 %or0123, %or4567
  ret i1 %or01234567
}

The vector code equivalent gets linearised, resulting in a much deeper serial chain, affecting IPC and making it more likely we will hit value tracking recursion depth limits:

define &lt;8 x i1&gt; @<!-- -->vector(&lt;8 x i32&gt; %a, &lt;8 x i32&gt; %b0, &lt;8 x i32&gt; %b1, &lt;8 x i32&gt; %b2, &lt;8 x i32&gt; %b3, &lt;8 x i32&gt; %b4, &lt;8 x i32&gt; %b5, &lt;8 x i32&gt; %b6, &lt;8 x i32&gt; %b7) local_unnamed_addr {
entry:
  %cmp0 = icmp eq &lt;8 x i32&gt; %b0, %a
  %cmp1 = icmp eq &lt;8 x i32&gt; %b1, %a
  %cmp2 = icmp eq &lt;8 x i32&gt; %b2, %a
  %cmp3 = icmp eq &lt;8 x i32&gt; %b3, %a
  %cmp4 = icmp eq &lt;8 x i32&gt; %b4, %a
  %cmp5 = icmp eq &lt;8 x i32&gt; %b5, %a
  %cmp6 = icmp eq &lt;8 x i32&gt; %b6, %a
  %cmp7 = icmp eq &lt;8 x i32&gt; %b7, %a
  %or01 = or &lt;8 x i1&gt; %cmp0, %cmp1
  %or23 = or &lt;8 x i1&gt; %cmp2, %cmp3
  %or45 = or &lt;8 x i1&gt; %cmp4, %cmp5
  %or67 = or &lt;8 x i1&gt; %cmp6, %cmp7
  %or0123 = or &lt;8 x i1&gt; %or01, %or23
  %or4567 = or &lt;8 x i1&gt; %or45, %or67
  %or01234567 = or &lt;8 x i1&gt; %or0123, %or4567
  ret &lt;8 x i1&gt; %or01234567
}

->

define &lt;8 x i1&gt; @<!-- -->vector(&lt;8 x i32&gt; %a, &lt;8 x i32&gt; %b0, &lt;8 x i32&gt; %b1, &lt;8 x i32&gt; %b2, &lt;8 x i32&gt; %b3, &lt;8 x i32&gt; %b4, &lt;8 x i32&gt; %b5, &lt;8 x i32&gt; %b6, &lt;8 x i32&gt; %b7) local_unnamed_addr {
  %cmp0 = icmp eq &lt;8 x i32&gt; %b0, %a
  %cmp1 = icmp eq &lt;8 x i32&gt; %b1, %a
  %cmp2 = icmp eq &lt;8 x i32&gt; %b2, %a
  %cmp3 = icmp eq &lt;8 x i32&gt; %b3, %a
  %cmp4 = icmp eq &lt;8 x i32&gt; %b4, %a
  %cmp5 = icmp eq &lt;8 x i32&gt; %b5, %a
  %cmp6 = icmp eq &lt;8 x i32&gt; %b6, %a
  %cmp7 = icmp eq &lt;8 x i32&gt; %b7, %a
  %or67 = or &lt;8 x i1&gt; %cmp1, %cmp0
  %or45 = or &lt;8 x i1&gt; %or67, %cmp2
  %or4567 = or &lt;8 x i1&gt; %or45, %cmp3
  %or23 = or &lt;8 x i1&gt; %or4567, %cmp4
  %or01 = or &lt;8 x i1&gt; %or23, %cmp5
  %or0123 = or &lt;8 x i1&gt; %or01, %cmp6
  %or01234567 = or &lt;8 x i1&gt; %or0123, %cmp7
  ret &lt;8 x i1&gt; %or01234567
}

@RKSimon
Copy link
Collaborator Author

RKSimon commented Mar 6, 2024

@dtcxzyw A small warning - there looks to be no test coverage for this - so this ticket might end up being rather lengthy - adding useful scalar AND vector test coverage, performance testing, .....

@SahilPatidar
Copy link
Contributor

@RKSimon Is this task still available? I'd be happy to work on it if it is.

@RKSimon
Copy link
Collaborator Author

RKSimon commented Mar 13, 2024

  • Add test coverage for reassociation of scalar/vector boolean types
  • Disable vector boolean reassociation to match scalar boolean types
  • Run test suite before/after tests for the effect of disabling scalar and vector boolean reassociation (although scalar most likely is the only one to show a diff)

@SahilPatidar
Copy link
Contributor

@RKSimon Sorry for the delay. I noticed a difference in vector behavior with -passes=reassociate. When I perform the same operation as in the example you provided

@SahilPatidar
Copy link
Contributor

@RKSimon

@RKSimon
Copy link
Collaborator Author

RKSimon commented Apr 13, 2024

@SahilPatidar Were you able to create some suitable tests ?

@SahilPatidar
Copy link
Contributor

I tried some tests similar to the ones you mentioned, but with some modifications.

@SahilPatidar
Copy link
Contributor

@RKSimon, What is the next step?

@RKSimon
Copy link
Collaborator Author

RKSimon commented May 8, 2024

2 things in parallel

  • Create a PR for the isIntOrIntVectorTy(1) change
  • Setup and run the test-suite and see if there is any perf differences between (1) current trunk, (2) removing the isIntegerTy(1) early-out entirely and (3) updating the early-out to isIntOrIntVectorTy(1)

@SahilPatidar
Copy link
Contributor

@RKSimon, I apologize for the delayed reply; I've been swamped with my GSoC project. Currently, I'm trying to build the test-suite but I'm facing a problem:

FAILED: MultiSource/Benchmarks/DOE-ProxyApps-C/RSBench/rsbench 
: && /Users/sahilpatidar/Desktop/llvm/llvm-project/test-suite/test-suite-build/tools/timeit --summary MultiSource/Benchmarks/DOE-ProxyApps-C/RSBench/rsbench.link.time /Users/sahilpatidar/Desktop/llvm/llvm-project/build/clang-build/bin/clang -O3 -arch arm64 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX11.3.sdk -Wl,-search_paths_first -Wl,-headerpad_max_install_names  MultiSource/Benchmarks/DOE-ProxyApps-C/RSBench/CMakeFiles/rsbench.dir/glibc_compat_rand.c.o MultiSource/Benchmarks/DOE-ProxyApps-C/RSBench/CMakeFiles/rsbench.dir/init.c.o MultiSource/Benchmarks/DOE-ProxyApps-C/RSBench/CMakeFiles/rsbench.dir/io.c.o MultiSource/Benchmarks/DOE-ProxyApps-C/RSBench/CMakeFiles/rsbench.dir/main.c.o MultiSource/Benchmarks/DOE-ProxyApps-C/RSBench/CMakeFiles/rsbench.dir/material.c.o MultiSource/Benchmarks/DOE-ProxyApps-C/RSBench/CMakeFiles/rsbench.dir/utils.c.o MultiSource/Benchmarks/DOE-ProxyApps-C/RSBench/CMakeFiles/rsbench.dir/xs_kernel.c.o -o MultiSource/Benchmarks/DOE-ProxyApps-C/RSBench/rsbench  -lm && cd /Users/sahilpatidar/Desktop/llvm/llvm-project/test-suite/test-suite-build/MultiSource/Benchmarks/DOE-ProxyApps-C/RSBench && /opt/homebrew/Cellar/cmake/3.25.1/bin/cmake -E create_symlink /Users/sahilpatidar/Desktop/llvm/llvm-project/test-suite/llvm-test-suite-main/MultiSource/Benchmarks/DOE-ProxyApps-C/RSBench/rsbench.reference_output /Users/sahilpatidar/Desktop/llvm/llvm-project/test-suite/test-suite-build/MultiSource/Benchmarks/DOE-ProxyApps-C/RSBench/rsbench.reference_output
Undefined symbols for architecture arm64:
  "___divdc3", referenced from:
      _fast_nuclear_W in xs_kernel.c.o
      _calculate_micro_xs in xs_kernel.c.o

cmake config:

cmake -G Ninja -DCMAKE_C_COMPILER=/Users/sahilpatidar/Desktop/llvm/llvm-project/build/clang-build/bin/clang \
        -C../llvm-test-suite-main/cmake/caches/O3.cmake -DTEST_SUITE_COLLECT_CODE_SIZE=OFF \
       -DCMAKE_BUILD_TYPE=Release ../llvm-test-suite-main

@RKSimon
Copy link
Collaborator Author

RKSimon commented Jan 8, 2025

@SahilPatidar Are you still looking at this at all please?

RKSimon added a commit to RKSimon/llvm-project that referenced this issue Jan 15, 2025
@RKSimon
Copy link
Collaborator Author

RKSimon commented Jan 16, 2025

I've run the test-suite on 3 variants:

  • nothing (remove existing isIntegerTy(1) exception)
  • baseline (trunk)
  • boolvector (extend isIntegerTy(1) expection to isIntOrIntVectorTy(1))
../llvm-test-suite/utils/compare.py --filter-short nothing.json baseline.json boolvector.json
Tests: 3385
Short Running: 2432 (filtered out)
Remaining: 953
Metric: exec_time

Program                                       exec_time
                                              nothing   baseline boolvector diff
MicroBench...ForIC2VW1LoopWithReductionTC64     23.97     14.59    14.76    64.3%
MicroBench...s.test:benchForIC1VW4LoopTC128     24.84     24.73    15.75    57.7%
MicroBench...orIC1VW4LoopWithReductionTC128     12.33     11.82    18.51    56.6%
MicroBench...ForIC2VW1LoopWithReductionTC63     15.20     19.36    22.22    46.2%
MicroBench...w.test:BM_INT_PREDICT_RAW/5001     23.33     18.58    16.07    45.2%
MicroBench...Raw.test:BM_MAT_X_MAT_RAW/5001   9646.40   6898.02  6692.31    44.1%
SingleSour...g/correlation/correlation.test      9.93      7.45    10.60    42.2%
MicroBench...orIC2VW1LoopWithReductionTC128     29.31     28.94    38.47    32.9%
MultiSourc...nchmarks/llubenchmark/llu.test      5.91      5.96     7.68    29.9%
MicroBench...runtime_checks_needed<16, int>      2.26      2.80     2.28    24.2%
MicroBench...Raw.test:BM_HYDRO_1D_RAW/44217     18.45     22.76    22.02    23.4%
MicroBench...Raw.test:BM_MULADDSUB_RAW/5001      5.95      7.15     5.83    22.8%
MicroBench...runtime_checks_needed<16, int>      2.28      2.78     2.30    21.8%
MicroBench...est:BM_INT_PREDICT_LAMBDA/5001     15.16     18.41    17.00    21.4%
MicroBench....test:BM_HYDRO_1D_LAMBDA/44217     18.16     21.50    19.61    18.4%
                           Geomean difference                                3.2%
           exec_time
run          nothing       baseline     boolvector        diff
count  953.000000     951.000000     950.000000     953.000000
mean   4459.237572    4372.606850    4369.977982    0.033258
std    47796.954839   46923.603821   46826.428537   0.051621
min    1.001400       1.001008       1.003193       0.000000
25%    4.785821       4.751082       4.821514       0.012759
50%    29.434356      28.586248      29.192429      0.021949
75%    391.581207     387.085074     391.963390     0.036533
max    773885.682074  777549.342105  769975.521405  0.642809

@RKSimon
Copy link
Collaborator Author

RKSimon commented Jan 16, 2025

The numbers look favorable, although both the existing i1 handling and enabling vXi1 handling as well can cause some smaller regressions - I think overall the change is worth it, but without knowing a lot more about the reassociation pass I don't know of anything else we can try to improve things further.

@dtcxzyw @nikic any thoughts?

@RKSimon RKSimon self-assigned this Jan 20, 2025
RKSimon added a commit that referenced this issue Jan 21, 2025
Extends what we already do for i1 types and don't serialize vXi1 logical expressions to improve ILP.

llvm-test-suite numbers
#64840 (comment)
indicate that both reassociations are a net win.

Fixes #64840
Fixes #63946
github-actions bot pushed a commit to arm/arm-toolchain that referenced this issue Jan 21, 2025
…#123329)

Extends what we already do for i1 types and don't serialize vXi1 logical expressions to improve ILP.

llvm-test-suite numbers
llvm/llvm-project#64840 (comment)
indicate that both reassociations are a net win.

Fixes #64840
Fixes #63946
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue https://github.com/llvm/llvm-project/contribute llvm:transforms missed-optimization
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants