Skip to content

SIMD backend #2310

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Tracked by #2258
certik opened this issue Aug 30, 2023 · 4 comments
Open
Tracked by #2258

SIMD backend #2310

certik opened this issue Aug 30, 2023 · 4 comments

Comments

@certik
Copy link
Contributor

certik commented Aug 30, 2023

diff --git a/src/libasr/ASR.asdl b/src/libasr/ASR.asdl
index 26e60e172..d6a29ecef 100644
--- a/src/libasr/ASR.asdl
+++ b/src/libasr/ASR.asdl
@@ -420,6 +420,7 @@ array_physical_type
     = DescriptorArray
     | PointerToDataArray
     | FixedSizeArray
+    | SIMDArray
     | NumPyArray
     | ISODescriptorArray

We'll use Annotated:

from typing import Annotated
from lpython import f32, SIMD
x: Annotated[f32[64], SIMD]

In ASR we use SIMDArray physical type, and then in the LLVM backend (or ASR->ASR pass) we ensure all such arrays get vectorized, otherwise we give a compile time error message. The conditions are:

  • Must be 1D array
  • Array element type and all operations must directly map into hardware instructions. For example on Apple M1 CPU, f32 plus and multiply is supported (it will compile), but f16 multiply is not (compile time error on Apple M1)
  • Fixed compile time size
  • The size must be a multiple of the hardware vector length. If we have 512 bits (AVX-512), the sizeof(element)size must be equal to 512n for n=1, 2, 3, .... If n=1, then the array is directly stored in a register. For n>2, the loop is unrolled (since the size is known at compile time), ensuring we hit maximum compute throughput (hide IO and latency).
@Shaikh-Ubaid
Copy link
Collaborator

What is the priority for this issue?

@certik certik mentioned this issue Aug 8, 2023
9 tasks
@certik
Copy link
Contributor Author

certik commented Aug 31, 2023

I added this issue here: #2258, but the relative priority is not clear yet. All the issues there are important.

@czgdp1807
Copy link
Collaborator

This seems interesting. I would like to work on this along with other things in my bucket list.

@certik
Copy link
Contributor Author

certik commented Sep 8, 2023

Ok, here is an example of a vectorized Mandelbrot that we should try to compile using LPython and get maximum performance.

import numpy as np

MAX_ITERS = 100

# c: Annotated[c64[:], SIMD]
def mandelbrot_kernel2(c):
    z = np.empty(c.shape, dtype=np.complex128)
    z[:] = c[:]
    nv = np.zeros(c.shape, dtype=np.int8)
    # True if the point is in set, False otherwise
    mask = np.empty(c.shape, dtype=np.bool_)
    for i in range(MAX_ITERS):
        mask[:] = (abs(z) <= 2)
        if (all(mask == False)): break
        z[mask] *= z[mask]
        z[mask] += c[mask]
        nv[mask] += 1
    return nv

n = 8
height = 4096 // n
width = 4096 // n
min_x = -2.0
max_x = 0.47
min_y = -1.12
max_y = 1.12
scale_x = (max_x - min_x) / width
scale_y = (max_y - min_y) / height
simd_width = 512
assert simd_width <= width

output = np.empty((height,width), dtype=np.int8)

x = np.empty((simd_width), dtype=np.complex128)
for h in range(height):
    cy = min_y + h * scale_y
    for w0 in range(width // simd_width):
        w = np.arange(w0*simd_width, (w0+1)*simd_width, dtype=np.int32)
        cx = min_x + w * scale_x
        x[:] = cx + 1j*cy
        output[h,w] = mandelbrot_kernel2(x)

print(output)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants