Skip to content

Commit c36a141

Browse files
authored
[SYCL][ESIMD] Add ESIMD extension intro document. (#2502)
Signed-off-by: Konstantin S Bobrovsky <[email protected]>
1 parent 659a6d0 commit c36a141

File tree

1 file changed

+93
-0
lines changed

1 file changed

+93
-0
lines changed
+93
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
# Intel "Explicit SIMD" SYCL extension
2+
3+
OneAPI provides the "Explicit SIMD" SYCL extension (or simply "ESIMD") for
4+
lower-level Intel GPU programming. It provides APIs closely matching Intel GPU ISA
5+
yet allows to write explicitly vectorized device code. This helps programmer to
6+
have more control over the generated code and depend less on compiler
7+
optimizations. The [specification](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/ExplicitSIMD/dpcpp-explicit-simd.md),
8+
[documented ESIMD APIs headers](https://github.com/intel/llvm/tree/sycl/sycl/include/CL/sycl/INTEL/esimd) and
9+
[working code examples](https://github.com/intel/llvm/blob/sycl/sycl/test/esimd/on-device) are available on the Intel DPC++ project's github.
10+
11+
**_NOTE:_** _This extension is under active development and lots of APIs are
12+
subject to change. There are currenly a number of restrictions specified
13+
below._
14+
15+
ESIMD kernels and functions always require the subgroup size of one, which means
16+
compiler never does vectorization across workitems in a subgroup. Instead,
17+
vectorization is experessed explicitly in the code by the programmer. Here is a
18+
trivial example which adds elements of two arrays and writes the results to the
19+
third:
20+
21+
```cpp
22+
float *A = static_cast<float *>(malloc_shared(Size * sizeof(float), dev, ctxt));
23+
float *B = static_cast<float *>(malloc_shared(Size * sizeof(float), dev, ctxt));
24+
float *C = static_cast<float *>(malloc_shared(Size * sizeof(float), dev, ctxt));
25+
26+
for (unsigned i = 0; i < Size; ++i) {
27+
A[i] = B[i] = i;
28+
}
29+
30+
// We need that many workitems. Each processes VL elements of data.
31+
cl::sycl::range<1> GlobalRange{Size / VL};
32+
// Number of workitems in each workgroup.
33+
cl::sycl::range<1> LocalRange{GroupSize};
34+
35+
cl::sycl::nd_range<1> Range(GlobalRange, LocalRange);
36+
37+
auto e = q.submit([&](handler &cgh) {
38+
cgh.parallel_for<class Test>(Range, [=](nd_item<1> ndi) SYCL_ESIMD_KERNEL {
39+
using namespace sycl::INTEL::gpu;
40+
41+
int i = ndi.get_global_id(0);
42+
simd<float, VL> va = block_load<float, VL>(A + i * VL);
43+
simd<float, VL> vb = block_load<float, VL>(B + i * VL);
44+
simd<float, VL> vc = va + vb;
45+
block_store<float, VL>(C + i * VL, vc);
46+
});
47+
});
48+
```
49+
50+
In this example the lambda function passed to the `parallel_for` is marked with
51+
a special attribute - `SYCL_ESIMD_KERNEL`. This tells the compiler that this
52+
kernel is a ESIMD one and ESIMD APIs can be used inside it. Here the `simd`
53+
objects and `block_load`/`block_store` intrinsics are used which are avaiable
54+
only in the ESIMD extension.
55+
Full runnable code sample can be found on the
56+
[github repo](https://github.com/intel/llvm/blob/sycl/sycl/test/esimd/on-device/vadd_usm.cpp).
57+
58+
#### Compiling and running ESIMD code.
59+
To compile a code which uses the ESIMD extension, a special compiler switch
60+
`-fsycl-explicit-simd` switch must be used:
61+
62+
> `$ clang++ -fsycl -fsycl-explicit-simd vadd_usm.cpp`
63+
64+
The resulting executable can only be run on Intel GPU hardware, such as
65+
Intel HD Graphics 600 or later. To run it, couple additional environment
66+
variables must be used - `SYCL_BE=PI_OPENCL` and
67+
`SYCL_PROGRAM_COMPILE_OPTIONS=-vc-codegen`:
68+
69+
> `$ SYCL_BE=PI_OPENCL SYCL_PROGRAM_COMPILE_OPTIONS=-vc-codegen ./a.out`
70+
71+
#### Restrictions
72+
73+
Here is a list of main restrictions imposed on using ESIMD extension. Note that
74+
some of them are not enforced by the compiler, which may lead to undefined
75+
program behavior if violated.
76+
77+
##### Features not supported with ESIMD extension:
78+
- Windows target
79+
- Ahead-of-time compilation
80+
- The [C and C++ Standard libraries support](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/C-CXX-StandardLibrary/C-CXX-StandardLibrary.rst)
81+
- The [Device library extensions](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/C-CXX-StandardLibrary/DeviceLibExtensions.rst)
82+
- Host device (in some cases)
83+
84+
##### Unsupported standard SYCL APIs:
85+
- Local accessors
86+
- Most of image APIs
87+
- Specialization constants
88+
- Memory access through a raw pointer returned by `sycl::accessor::get_pointer()`
89+
90+
##### Other restrictions:
91+
- Only Intel GPU device is supported
92+
- Usual and ESIMD DPC++ kernels can not co-exist in the same application in most
93+
cases

0 commit comments

Comments
 (0)