-
Notifications
You must be signed in to change notification settings - Fork 13.6k
Early exit optimization of Fortran array expressions #129812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@llvm/issue-subscribers-flang-ir Author: Ivan Pribec (ivan-pi)
Consider a function for checking if an array is sorted:
```fortran
!
! Check an array of integers is sorted in ascending order
!
logical function is_sorted_scalar(n,a) result(is_sorted)
integer, intent(in) :: n
integer, intent(in) :: a(n)
integer :: i
!$omp simd simdlen(8) early_exit
do i = 2, n
if (a(i) < a(i-1)) then
is_sorted = .false.
return
end if
end do
is_sorted = .true.
end function
logical function is_sorted_all(n,a) result(is_sorted) program benchmark
contains
end program
~/fortran/is_sorted$ make FC=flang-new FFLAGS="-O2 -march=native" standalone
|
At O1 and above, flang does not create an array temporary for Early exit is not always better due to its lack of vectorization as you mentioned, it is very dependent of the . Your reproducer inserted the unsorted value at the beginning of the array, so yes, the early exit is much better, but if you place it at the end (e.g as the last element of the array), the numbers are reversed:
Doing quick maths, the micro benchmark with the unsorted element at the last positions shows the early exit version is currently at best 4x (6. / 1.5) more expensive per iteration, but will do only N/2 iterations on average for unsorted containers (assuming random distribution in all the arrays to be sorted out there) while the flang implementation always does N iterations. So "on average", the flang implementation is twice as fast than the early exit implementation (N vs 4*N/2), and when I modify your benchmark to do the arithmetic average of runs moving the unordered element from 1 to n, that is pretty much what I see (I actually even see a 4x difference, measuring an average of 6s, just like with the scalar version of flang, Which tells me gfortran may have done extra "unexpected" optimizations in the original micro benchmark). Also, I think the reduction loop without early exit is easier to parallelize, also I am not an expert here, and we could anyway generate different code for sequential and parallel. So I tend to think that without further optimization of early exit loops in LLVM, flang implementation is the right simple one (I am sure they are advanced implementations that would do better, using a library would however force the materialization of a temporary for the argument here, which is probably worse). |
Thanks for the more thorough analysis changing the position of the unsorted element and the rest of the explanation. I'm happy to learn no temporary is generated. It's possible I got confused by viewing an I read somewhere that both Arm SVE and RISC-V have features which allow vectorization of such loops. However, |
That is very likely, there is some work going on to try to get rid of array temporaries at O1 and above (you may still saw some array temporaries for some other intrinsic array arguments where other compiler may inline and get rid of the temporary). Out of curiosity, did you see a slow down in an application/benchmark with flang implementation of all, or was your finding purely coming from micro benchmarks? I am asking because I have not seen any applications yet where flang is noticeably slower because of the ALL implementation choice, but if an application has "biased" unordered arrays, the flang implementation will indeed be slower (just like applications "biased" in the other direction with "late unordered" elements would favor flang).
I am not familiar with this, but it would be great if LLVM could leverage that! |
There is a bit of work for vectorising early exit loops with SVE, e.g. #128880. I am not an expert but as I understand it, it is quite complex to determine whether this sort of thing will actually be profitable as some current generation cores are slower for some of the more exotic SVE features. |
After more thinking, there is something we could do on the flang side to improve the sequential performance: we can unroll before generating an early exit. I tested manually with a naïve (without proper post loop and size check to enter the unroll loop, so only valid for
With flang pipeline, it got me to 2.6 average vs 1.5 for the current LLVM does not have the liberty flang has to unroll the loop like I did because it does not know it is OK to "over-evaluate" after the early exit. Flang could do it as long as it knows this is a reduction and we can evaluate as many iteration we need/want. So we can do better, but this opens a door to something we have not dwell-into currently in flang: target dependent loop optimizations (to find the optimal unrolling factor). So far we happily delegate that to LLVM. All that to say, we could do something on the flang side, but that optimization would not be my priority without an actual application where the all/any reduction speedup would be significant overall. |
Uh oh!
There was an error while loading. Please reload this page.
Consider a function for checking if an array is sorted:
It appears like the
is_sorted_all
function in flang generates a temporary array for thea(2:n) >= a(1:n-1)
expression, and then performs theall
reduction. This is fast due to vectorization, but it missed the chance of early exit.The effect is noticeable in the runtime:
It would be nice if early exit vectorization were also supported (https://discourse.llvm.org/t/rfc-supporting-more-early-exit-loops/84690). With x86 SIMD extensions this still has to be done manually it seems: http://0x80.pl/notesen/2018-04-11-simd-is-sorted.html
Compiler Explorer link: https://godbolt.org/z/c3GK5do6T
The text was updated successfully, but these errors were encountered: