|
| 1 | +# Intel "Explicit SIMD" SYCL extension |
| 2 | + |
| 3 | +OneAPI provides the "Explicit SIMD" SYCL extension (or simply "ESIMD") for |
| 4 | +lower-level Intel GPU programming. It provides APIs closely matching Intel GPU ISA |
| 5 | +yet allows to write explicitly vectorized device code. This helps programmer to |
| 6 | +have more control over the generated code and depend less on compiler |
| 7 | +optimizations. The [specification](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/ExplicitSIMD/dpcpp-explicit-simd.md), |
| 8 | +[documented ESIMD APIs headers](https://github.com/intel/llvm/tree/sycl/sycl/include/CL/sycl/INTEL/esimd) and |
| 9 | +[working code examples](https://github.com/intel/llvm/blob/sycl/sycl/test/esimd/on-device) are available on the Intel DPC++ project's github. |
| 10 | + |
| 11 | +**_NOTE:_** _This extension is under active development and lots of APIs are |
| 12 | +subject to change. There are currenly a number of restrictions specified |
| 13 | +below._ |
| 14 | + |
| 15 | +ESIMD kernels and functions always require the subgroup size of one, which means |
| 16 | +compiler never does vectorization across workitems in a subgroup. Instead, |
| 17 | +vectorization is experessed explicitly in the code by the programmer. Here is a |
| 18 | +trivial example which adds elements of two arrays and writes the results to the |
| 19 | +third: |
| 20 | + |
| 21 | +```cpp |
| 22 | + float *A = static_cast<float *>(malloc_shared(Size * sizeof(float), dev, ctxt)); |
| 23 | + float *B = static_cast<float *>(malloc_shared(Size * sizeof(float), dev, ctxt)); |
| 24 | + float *C = static_cast<float *>(malloc_shared(Size * sizeof(float), dev, ctxt)); |
| 25 | + |
| 26 | + for (unsigned i = 0; i < Size; ++i) { |
| 27 | + A[i] = B[i] = i; |
| 28 | + } |
| 29 | + |
| 30 | + // We need that many workitems. Each processes VL elements of data. |
| 31 | + cl::sycl::range<1> GlobalRange{Size / VL}; |
| 32 | + // Number of workitems in each workgroup. |
| 33 | + cl::sycl::range<1> LocalRange{GroupSize}; |
| 34 | + |
| 35 | + cl::sycl::nd_range<1> Range(GlobalRange, LocalRange); |
| 36 | + |
| 37 | + auto e = q.submit([&](handler &cgh) { |
| 38 | + cgh.parallel_for<class Test>(Range, [=](nd_item<1> ndi) SYCL_ESIMD_KERNEL { |
| 39 | + using namespace sycl::INTEL::gpu; |
| 40 | + |
| 41 | + int i = ndi.get_global_id(0); |
| 42 | + simd<float, VL> va = block_load<float, VL>(A + i * VL); |
| 43 | + simd<float, VL> vb = block_load<float, VL>(B + i * VL); |
| 44 | + simd<float, VL> vc = va + vb; |
| 45 | + block_store<float, VL>(C + i * VL, vc); |
| 46 | + }); |
| 47 | + }); |
| 48 | +``` |
| 49 | +
|
| 50 | +In this example the lambda function passed to the `parallel_for` is marked with |
| 51 | +a special attribute - `SYCL_ESIMD_KERNEL`. This tells the compiler that this |
| 52 | +kernel is a ESIMD one and ESIMD APIs can be used inside it. Here the `simd` |
| 53 | +objects and `block_load`/`block_store` intrinsics are used which are avaiable |
| 54 | +only in the ESIMD extension. |
| 55 | +Full runnable code sample can be found on the |
| 56 | +[github repo](https://github.com/intel/llvm/blob/sycl/sycl/test/esimd/on-device/vadd_usm.cpp). |
| 57 | +
|
| 58 | +#### Compiling and running ESIMD code. |
| 59 | +To compile a code which uses the ESIMD extension, a special compiler switch |
| 60 | +`-fsycl-explicit-simd` switch must be used: |
| 61 | +
|
| 62 | +> `$ clang++ -fsycl -fsycl-explicit-simd vadd_usm.cpp` |
| 63 | +
|
| 64 | +The resulting executable can only be run on Intel GPU hardware, such as |
| 65 | +Intel HD Graphics 600 or later. To run it, couple additional environment |
| 66 | +variables must be used - `SYCL_BE=PI_OPENCL` and |
| 67 | +`SYCL_PROGRAM_COMPILE_OPTIONS=-vc-codegen`: |
| 68 | +
|
| 69 | +> `$ SYCL_BE=PI_OPENCL SYCL_PROGRAM_COMPILE_OPTIONS=-vc-codegen ./a.out` |
| 70 | +
|
| 71 | +#### Restrictions |
| 72 | +
|
| 73 | +Here is a list of main restrictions imposed on using ESIMD extension. Note that |
| 74 | +some of them are not enforced by the compiler, which may lead to undefined |
| 75 | +program behavior if violated. |
| 76 | +
|
| 77 | +##### Features not supported with ESIMD extension: |
| 78 | +- Windows target |
| 79 | +- Ahead-of-time compilation |
| 80 | +- The [C and C++ Standard libraries support](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/C-CXX-StandardLibrary/C-CXX-StandardLibrary.rst) |
| 81 | +- The [Device library extensions](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/C-CXX-StandardLibrary/DeviceLibExtensions.rst) |
| 82 | +- Host device (in some cases) |
| 83 | +
|
| 84 | +##### Unsupported standard SYCL APIs: |
| 85 | +- Local accessors |
| 86 | +- Most of image APIs |
| 87 | +- Specialization constants |
| 88 | +- Memory access through a raw pointer returned by `sycl::accessor::get_pointer()` |
| 89 | +
|
| 90 | +##### Other restrictions: |
| 91 | +- Only Intel GPU device is supported |
| 92 | +- Usual and ESIMD DPC++ kernels can not co-exist in the same application in most |
| 93 | + cases |
0 commit comments