Skip to content

[SYCL] [DOC] Prepare design-document for assert feature #3461

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 49 commits into from
May 31, 2021
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
2911ea7
[SYCL] [DOC] Prepare design-document for assert feature
Mar 31, 2021
b69a1cd
Remove redundant file
Mar 31, 2021
15ea88e
Fix typo
Apr 1, 2021
ca08fec
Address some review comments. Add description of built-ins.
Apr 5, 2021
1f8d9a9
Fix links
Apr 5, 2021
2ee590c
Clarify that assertion failure message is printed by DPCPP Runtime
Apr 5, 2021
77699a2
Clarify that fallback assert impl is synchronous
Apr 6, 2021
001a573
Fix typo in level-zero ext draft
Apr 6, 2021
32b6479
Address some review comments.
Apr 7, 2021
b8637c2
Add exception extension
Apr 8, 2021
b0cd85f
Use error-code instead of distinct exception.
Apr 8, 2021
8c03648
[SYCL] Add OpenCL extension for assert error code
Apr 9, 2021
121c945
[SYCL] Add Level-Zero extension for assert error code
Apr 9, 2021
13b40fd
Merge branch 'private/s-kanaev/assert-ocl-l0' into private/s-kanaev/a…
Apr 9, 2021
a4b4884
Remove draft files
Apr 9, 2021
c06db5f
Remove unwanted part
Apr 9, 2021
823124a
Merge branch 'private/s-kanaev/assert-ocl-l0' into private/s-kanaev/a…
Apr 9, 2021
a99368b
Add limitations on submit to same queue after exception thrown.
Apr 9, 2021
78d7fcb
Add format of assert message
Apr 9, 2021
6882e95
Clarify where kernel wrapping takes place
Apr 9, 2021
32663e0
Changes to SYCL specification
Apr 13, 2021
2b84a83
Elaborate on limitations
Apr 13, 2021
423107b
Fix link
Apr 14, 2021
7611511
Add sequence describing how DPCPP RT gets to know about assert failure
Apr 14, 2021
a31b808
Add notes on property set usage
Apr 14, 2021
257054a
Address comments
Apr 14, 2021
3f50173
Fix typo and format note
Apr 14, 2021
c1326aa
Fix typo
Apr 14, 2021
5095b1a
Add extension to README
Apr 14, 2021
5078fcc
Note on how property set gets to be set
Apr 14, 2021
4dc7b1f
Merge branch 'sycl' into private/s-kanaev/assert-abort
Apr 15, 2021
9bcac02
Partially remove mentioning of async exception throw
Apr 15, 2021
7ec3ac8
Add Assert.md to index
Apr 15, 2021
8cbfde7
Remove the rest of exception throws
Apr 15, 2021
cc085f5
Address review comments
Apr 22, 2021
8835bf8
Document program-scope variable approach
May 6, 2021
8835756
Merge remote-tracking branch 'public/sycl' into private/s-kanaev/asse…
May 6, 2021
ecb8659
Remove L0 and OCL extensions.
May 7, 2021
07debdb
Address comments
May 11, 2021
995e4d8
Fix typo
May 12, 2021
b57ac48
Fix typo
May 12, 2021
d2f13ff
Address review comments
May 17, 2021
6281bc5
Switch to __devicelib_assert_read
May 19, 2021
a5461f3
Remove use of NDEBUG from suggested changes
May 19, 2021
32a32f4
Reorder text to increase readability
May 19, 2021
641d071
Address review comment
May 20, 2021
dc058a9
Address review comments
May 27, 2021
16fd8f0
Add aspect
May 27, 2021
fbca768
Update extension with suggestion
s-kanaev May 27, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
144 changes: 144 additions & 0 deletions sycl/doc/Assert.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
# Assert feature

**IMPORTANT**: This document is a draft.

During debugging of kernel code user may put assertions here and there.
The expected behaviour of assertion failure at host is application abort.
Our choice for device-side assertions is asynchronous exception in order to
allow for extensibility.

The user is free to disable assertions by defining `NDEBUG` macro at
compile-time.


## Use-case example

```
using namespace cl::sycl;
auto ErrorHandler = [] (exception_list Exs) {
for (exception_ptr const& E : Exs) {
try {
std::rethrow_exception(E);
}
catch (event_error const& Ex) {
std::cout << “Exception - ” << Ex.what(); // assertion failed
std::abort();
}
}
};

void user_func(item<2> Item) {
assert((Item[0] % 2) && “Nil”);
}

int main() {
queue Q(ErrorHandler);
q.submit([&] (handler& CGH) {
CGH.parallel_for<class TheKernel>(range<2>{N, M}, [=](item<2> It) {
do_smth();
user_func(It);
do_smth_else();
});
});
Q.wait_and_throw();
std::cout << “One shouldn’t see this message.“;
return 0;
}
```

In this use-case every work-item with even X dimension will trigger assertion
failure. Assertion failure should be reported via asynchronous exceptions. If
asynchronous exception handler is set the failure is reported with
`cl::sycl::event_error` exception. Otherwise, SYCL Runtime should trigger abort.
At least one failed assertion should be reported.

When multiple kernels are enqueued and both fail at assertion at least single
assertion should be reported.

## User requirements

From user's point of view there are the following requirements:

| # | Title | Description | Importance |
| - | ----- | ----------- | ---------- |
| 1 | Handle assertion failure | Signal about assertion failure via SYCL asynchronous exception | Must have |
| 2 | Print assert message | Assert function should print message to stderr at host | Must have |
| 3 | Stop under debugger | When debugger is attached, break at assertion point | Highly desired |
| 4 | Reliability | Assert failure should be reported regardless of kernel deadlock | Highly desired |

## Contents of `cl::sycl::event_error`

`cl::sycl::event_error::what()` should return the same assertion failure message
as is printed at the time being.

Other than that, interface of `cl::sycl::event_error` should look like:
```
class event_error : public runtime_error {
public:
event_error() = default;

event_error(const char *Msg, cl_int Err)
: event_error(string_class(Msg), Err) {}

event_error(const string_class &Msg, cl_int Err) : runtime_error(Msg, Err) {}

/// Returns global ID with the dimension provided
int globalId(int Dim) const;

/// Returns local ID with the dimension provided
int localId(int Dim) const;
};
```

Regardless of whether asynchronous exception handler is set or not, there's an
action to be performed by SYCL Runtime. To achieve this, information about
assert failure should be propagated from device-side to SYCL Runtime. This
should be performed via calls to `clGetEventInfo` for OpenCL backend and
`zeEventQueryStatus` for Level-Zero backend.

## Terms

- Device-side Runtime - part of device-code, which is supplied by Device-side
Compiler.
- Low-level Runtime - the backend/runtime, behind DPCPP Runtime.
- Device-side Compiler - compiler which generates device-native bitcode based
on input SPIR-V image.
- Accessor metadata - parts of accessor representation at device-side: pointer,
ranges, offset.

## How it works?

For the time being, `assert(expr)` macro ends up in call to
`__devicelib_assert_fail` function. This function is part of [Device library extension](doc/extensions/C-CXX-StandardLibrary/DeviceLibExtensions.rst#cl_intel_devicelib_cassert).
Device code already contains call to the function. Currently, a device-binary
is always linked against fallback implementation.
Device-side compiler/linker provides their implementation of `__devicelib_assert_fail`
and prefer this implementation over fallback one.

If Device-side Runtime supports `__devicelib_assert_fail` then Low-Level Runtime
is responsible for:
- detecting if assert failure took place;
- flushing assert message to `stderr` on host.
When detected, Low-level Runtime reports assert failure to DPCPP Runtime
at synchronization points.

Refer to [OpenCL](doc/extensions/Assert/opencl.md) and [Level-Zero](doc/extensions/Assert/level-zero.md)
extensions.

If Device-side Runtime doesn't support `__devicelib_assert_fail` then a buffer
based approach comes in place. The approach doesn't require any support from
Device-side Runtime. Neither it does from Low-level Runtime.

Within this approach, a dedicated assert buffer is allocated and implicit kernel
argument is introduced. The argument is an accessor with `discard_read_write`
or `discard_write` access mode. Accessor metadata is stored to program scope
variable. This allows to refer to the accessor without modifying each and every
user's function. Fallback implementation of `__devicelib_assert_fail` restores
accessor metadata from program scope variable and writes assert information to
the assert buffer. Atomic operations are used in order to not overwrite existing
information.

Storing and restoring of accessor metadata to/from program scope variable is
performed with help of builtins. Implementations of these builtins are
substituted by frontend.

19 changes: 19 additions & 0 deletions sycl/doc/extensions/Assert/level-zero.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Overview

This extension enables detection of assert failure of kernel.

# New enum value

`ze_result_t` enumeration should be augmented with `ZE_RESULT_ABORTED` enum
element. This enum value indicated a detected assert failure at device-side.

# Changed API

```
ze_event_handle_t Event; // describes an event of kernel been submitted previously
ze_result Result = zeEventQueryStatus(Event);
```

If kernel failed an assertion `zeEventQueryStatus` should return

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don;t think this is possible to achieve in asynchronous / non-blocking way in L0.

We dont have any communication between kernel and event - so we can;t signal events with "assert happened" information.

if we use global / program wide assert buffer - each kernel will be using the same assert happened flag - we do not have fine grain control to determine which kernel - and which connected event fired the assert.

Fences could be used - allowing to synchronize at cmdQueue level and not kernel - any kernel causing assert executed in cmd Queue can then make fence synchronize to return error:https://spec.oneapi.com/level-zero/latest/core/PROG.html#fences

Copy link
Contributor Author

@s-kanaev s-kanaev Apr 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it still possible in OpenCL?
Can the OpenCL approach be reused in Level-Zero?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you, please, provide more details about using fences?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fences are decribed in L0 spec - they are similar to events, but directly connected to command queues: https://spec.oneapi.com/level-zero/latest/core/PROG.html#fences

In OpenCL the submission model is different - each enqueue is independent - single kernel is submitted ( queued) at a time. L0 operates on command lists that may contain multiple kernels - once cmd list is submitted to HW - we can;t control when a kernel in whole sequence is started completed.

OpenCL handles kernels with printf in a blocking way - enqueueNDRangeKErnel with printf makes this a blocking call - so we have fine control when specific kernel is completed - we can do the same for assert() message - output event will be created when the kernel has already finished. I L0 this is not possible - as we would have to synchronize whoel command list.

`ZE_RESULT_ABORTED`.

22 changes: 22 additions & 0 deletions sycl/doc/extensions/Assert/opencl.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Overview

This extension enables detection of assert failure of kernel.

# New error code

`CL_ASSERT_FAILURE` is added to indicate a detected assert failure at
device-side.

# Changed API

```
cl_event Event; // describes an event of kernel been submitted previously
cl_int Result;
size_t ResultSize;

clGetEventInfo(Event, CL_EVENT_COMMAND_EXECUTION_STATUS, sizeof(Result), &Result, &ResultSize);
```

If kernel failed an assertion `clGetEventInfo` should put `CL_ASSERT_FAILURE`
in `Result`.