Skip to content

Early exit optimization of Fortran array expressions #129812

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ivan-pi opened this issue Mar 5, 2025 · 6 comments
Open

Early exit optimization of Fortran array expressions #129812

ivan-pi opened this issue Mar 5, 2025 · 6 comments
Labels

Comments

@ivan-pi
Copy link

ivan-pi commented Mar 5, 2025

Consider a function for checking if an array is sorted:

!
! Check an array of integers is sorted in ascending order
!
logical function is_sorted_scalar(n,a) result(is_sorted)
    integer, intent(in) :: n
    integer, intent(in) :: a(n)
    integer :: i
    !$omp simd simdlen(8) early_exit
    do i = 2, n
        if (a(i) < a(i-1)) then
            is_sorted = .false.
            return
        end if
    end do
    is_sorted = .true.
end function

logical function is_sorted_all(n,a) result(is_sorted)
    integer, intent(in) :: n
    integer, intent(in) :: a(n)
    is_sorted = all(a(2:n) >= a(1:n-1))
end function

program benchmark

    implicit none
    integer, allocatable :: a(:)

    integer :: i, n

    external :: is_sorted_scalar
    external :: is_sorted_all

    logical :: is_sorted_scalar
    logical :: is_sorted_all

    character(len=32) :: str
    integer :: tmp

    tmp = 0
    n = 20000

    if (command_argument_count() > 0) then
        call get_command_argument(1,str)
        read(str,*) tmp
        if (tmp > 0) n = tmp
    end if
    print *, "n = ",  n

    allocate(a(n))

    ! Fill ascending numbers
    do i = 1, n
        a(i) = i
    end do

    ! Introduce an unsorted value
    a(100) = 1001
    !a(101) = 1000

    call measure(100000,a,is_sorted_scalar,"scalar")
    call measure(100000,a,is_sorted_all,   "all")

contains

    impure subroutine measure(nreps,a,func,name)
        integer, intent(in) :: nreps
        integer, intent(in) :: a(:)
        logical :: func
        character(len=*), intent(in) :: name
        integer(8) :: t1, t2, rate
        real(kind(1.0d0)) :: elapsed
        logical :: res

        character(len=12) :: str

        integer :: k
        call system_clock(t1)
        do k = 1, nreps
            res = func(size(a),a)
        end do
        call system_clock(t2,rate)

        elapsed = (t2 - t1)/real(rate,kind(elapsed))

        str = adjustl(name)
        print '(A12,F12.4,L2)', str, elapsed/nreps*1.e6, res

        ! Time is in microseconds

    end subroutine

end program

It appears like the is_sorted_all function in flang generates a temporary array for the a(2:n) >= a(1:n-1) expression, and then performs the all reduction. This is fast due to vectorization, but it missed the chance of early exit.

The effect is noticeable in the runtime:

~/fortran/is_sorted$ make FC=flang-new FFLAGS="-O2 -march=native" standalone
flang-new -O2 -march=native -o standalone standalone.f90
~/fortran/is_sorted$ ./standalone 
 n =  20000
scalar            0.0673 F
all               1.7358 F
~/fortran/is_sorted$ make clean
rm -rf *.o benchmark standalone
~/fortran/is_sorted$ make FC=gfortran FFLAGS="-O2 -march=native" standalone
gfortran -O2 -march=native -o standalone standalone.f90
~/fortran/is_sorted$ ./standalone 
 n =        20000
scalar            0.0389 F
all               0.0390 F

It would be nice if early exit vectorization were also supported (https://discourse.llvm.org/t/rfc-supporting-more-early-exit-loops/84690). With x86 SIMD extensions this still has to be done manually it seems: http://0x80.pl/notesen/2018-04-11-simd-is-sorted.html

Compiler Explorer link: https://godbolt.org/z/c3GK5do6T

@llvmbot
Copy link
Member

llvmbot commented Mar 5, 2025

@llvm/issue-subscribers-flang-ir

Author: Ivan Pribec (ivan-pi)

Consider a function for checking if an array is sorted: ```fortran ! ! Check an array of integers is sorted in ascending order ! logical function is_sorted_scalar(n,a) result(is_sorted) integer, intent(in) :: n integer, intent(in) :: a(n) integer :: i !$omp simd simdlen(8) early_exit do i = 2, n if (a(i) < a(i-1)) then is_sorted = .false. return end if end do is_sorted = .true. end function

logical function is_sorted_all(n,a) result(is_sorted)
integer, intent(in) :: n
integer, intent(in) :: a(n)
is_sorted = all(a(2:n) >= a(1:n-1))
end function

program benchmark

implicit none
integer, allocatable :: a(:)

integer :: i, n

external :: is_sorted_scalar
external :: is_sorted_all

logical :: is_sorted_scalar
logical :: is_sorted_all

character(len=32) :: str
integer :: tmp

tmp = 0
n = 20000

if (command_argument_count() &gt; 0) then
    call get_command_argument(1,str)
    read(str,*) tmp
    if (tmp &gt; 0) n = tmp
end if
print *, "n = ",  n

allocate(a(n))

! Fill ascending numbers
do i = 1, n
    a(i) = i
end do

! Introduce an unsorted value
a(100) = 1001
!a(101) = 1000

call measure(100000,a,is_sorted_scalar,"scalar")
call measure(100000,a,is_sorted_all,   "all")

contains

impure subroutine measure(nreps,a,func,name)
    integer, intent(in) :: nreps
    integer, intent(in) :: a(:)
    logical :: func
    character(len=*), intent(in) :: name
    integer(8) :: t1, t2, rate
    real(kind(1.0d0)) :: elapsed
    logical :: res

    character(len=12) :: str

    integer :: k
    call system_clock(t1)
    do k = 1, nreps
        res = func(size(a),a)
    end do
    call system_clock(t2,rate)

    elapsed = (t2 - t1)/real(rate,kind(elapsed))

    str = adjustl(name)
    print '(A12,F12.4,L2)', str, elapsed/nreps*1.e6, res

    ! Time is in microseconds

end subroutine

end program


It appears to me that in `is_sorted_all` flang generates a temporary array for the `a(2:n) &gt;= a(1:n-1)` expression, and then performs the `all` reduction. This is fast due to vectorization, but it missed the chance of early exit. 

The effect is noticeable in the runtime:

~/fortran/is_sorted$ make FC=flang-new FFLAGS="-O2 -march=native" standalone
flang-new -O2 -march=native -o standalone standalone.f90
~/fortran/is_sorted$ ./standalone
n = 20000
scalar 0.0673 F
all 1.7358 F
~/fortran/is_sorted$ make clean
rm -rf *.o benchmark standalone
~/fortran/is_sorted$ make FC=gfortran FFLAGS="-O2 -march=native" standalone
gfortran -O2 -march=native -o standalone standalone.f90
~/fortran/is_sorted$ ./standalone
n = 20000
scalar 0.0389 F
all 0.0390 F


It would be nice if early exit vectorization were also supported (https://discourse.llvm.org/t/rfc-supporting-more-early-exit-loops/84690). With x86 SIMD extensions this still has to be done manually it seems: http://0x80.pl/notesen/2018-04-11-simd-is-sorted.html
</details>

@jeanPerier
Copy link
Contributor

jeanPerier commented Mar 5, 2025

At O1 and above, flang does not create an array temporary for a(2:n) >= a(1:n-1), instead, it inlines the element computation of this array expression inside the all loop.

Early exit is not always better due to its lack of vectorization as you mentioned, it is very dependent of the . Your reproducer inserted the unsorted value at the beginning of the array, so yes, the early exit is much better, but if you place it at the end (e.g as the last element of the array), the numbers are reversed:

$ gfortran -O3 -march=native -o standalone standalone.f90
$ ./standalone
n =        20000
scalar            6.1321 F
all               6.0269 F

$ flang -O3 -march=native -o standalone standalone.f90
$ ./standalone
n =        20000
scalar           12.0968 F
all               1.4955 F

Doing quick maths, the micro benchmark with the unsorted element at the last positions shows the early exit version is currently at best 4x (6. / 1.5) more expensive per iteration, but will do only N/2 iterations on average for unsorted containers (assuming random distribution in all the arrays to be sorted out there) while the flang implementation always does N iterations.

So "on average", the flang implementation is twice as fast than the early exit implementation (N vs 4*N/2), and when I modify your benchmark to do the arithmetic average of runs moving the unordered element from 1 to n, that is pretty much what I see (I actually even see a 4x difference, measuring an average of 6s, just like with the scalar version of flang, Which tells me gfortran may have done extra "unexpected" optimizations in the original micro benchmark).

Also, I think the reduction loop without early exit is easier to parallelize, also I am not an expert here, and we could anyway generate different code for sequential and parallel.

So I tend to think that without further optimization of early exit loops in LLVM, flang implementation is the right simple one (I am sure they are advanced implementations that would do better, using a library would however force the materialization of a temporary for the argument here, which is probably worse).

@ivan-pi
Copy link
Author

ivan-pi commented Mar 5, 2025

Thanks for the more thorough analysis changing the position of the unsorted element and the rest of the explanation. I'm happy to learn no temporary is generated. It's possible I got confused by viewing an -O0 assembly listing or was maybe using a slightly older flang binary installed using Homebrew. I prepared this more than two months ago.

I read somewhere that both Arm SVE and RISC-V have features which allow vectorization of such loops. However, !$omp simd doesn't support this, the !$omp simd simdlen(8) early_exit above is using an obsolescent Intel extension (it works with the older ifort compiler).

@jeanPerier
Copy link
Contributor

It's possible I got confused by viewing an -O0 assembly listing or was maybe using a slightly older flang binary installed using Homebrew. I prepared this more than two months ago.

That is very likely, there is some work going on to try to get rid of array temporaries at O1 and above (you may still saw some array temporaries for some other intrinsic array arguments where other compiler may inline and get rid of the temporary).

Out of curiosity, did you see a slow down in an application/benchmark with flang implementation of all, or was your finding purely coming from micro benchmarks?

I am asking because I have not seen any applications yet where flang is noticeably slower because of the ALL implementation choice, but if an application has "biased" unordered arrays, the flang implementation will indeed be slower (just like applications "biased" in the other direction with "late unordered" elements would favor flang).

I read somewhere that both Arm SVE and RISC-V have features which allow vectorization of such loops.

I am not familiar with this, but it would be great if LLVM could leverage that!

@tblah
Copy link
Contributor

tblah commented Mar 5, 2025

I am not familiar with this, but it would be great if LLVM could leverage that!

There is a bit of work for vectorising early exit loops with SVE, e.g. #128880. I am not an expert but as I understand it, it is quite complex to determine whether this sort of thing will actually be profitable as some current generation cores are slower for some of the more exotic SVE features.

@jeanPerier
Copy link
Contributor

After more thinking, there is something we could do on the flang side to improve the sequential performance: we can unroll before generating an early exit.

I tested manually with a naïve (without proper post loop and size check to enter the unroll loop, so only valid for n that are multiple of 8):

logical function is_sorted_scalar_unroll_8(n,a) result(is_sorted)
    integer, intent(in) :: n
    integer, intent(in) :: a(n)
    integer(8) :: i
    do i = 2, n-7,8
        if (a(i) < a(i-1).or.a(i+1) < a(i).or.a(i+2) < a(i+1).or. a(i+3)<a(i+2).or.a(i+4) < a(i+3).or.a(i+5) < a(i+4).or.a(i+6) < a(i+5).or. a(i+7)<a(i+6)) then
            is_sorted = .false.
            return
        end if
    end do
    is_sorted = .true.
end function

With flang pipeline, it got me to 2.6 average vs 1.5 for the current all implementation. Looking at the assembly, it was still not vectorized, but running opt -O3 on the LLVM IR for is_sorted_scalar_unroll_8_, I was able to get a vectorized version that got me to an average of 1.0, so a 30% improvement (still a bit slower if all elements are iterated over (2.0 vs 1.5)).

LLVM does not have the liberty flang has to unroll the loop like I did because it does not know it is OK to "over-evaluate" after the early exit. Flang could do it as long as it knows this is a reduction and we can evaluate as many iteration we need/want.

So we can do better, but this opens a door to something we have not dwell-into currently in flang: target dependent loop optimizations (to find the optimal unrolling factor). So far we happily delegate that to LLVM.

All that to say, we could do something on the flang side, but that optimization would not be my priority without an actual application where the all/any reduction speedup would be significant overall.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants