diff --git a/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc b/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc new file mode 100644 index 0000000000000..178dfa9a4315e --- /dev/null +++ b/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph.asciidoc @@ -0,0 +1,1666 @@ += sycl_ext_oneapi_graph +:source-highlighter: coderay +:coderay-linenums-mode: table + +// This section needs to be after the document title. +:doctype: book +:toc2: +:toc: left +:encoding: utf-8 +:lang: en +:sectnums: + +:blank: pass:[ +] + +// Set the default source code type in this document to C++, +// for syntax highlighting purposes. This is needed because +// docbook uses c++ and html5 uses cpp. +:language: {basebackend@docbook:c++:cpp} + +== Notice + +Copyright (c) 2022-2023 Intel Corporation. All rights reserved. + +NOTE: Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are +trademarks of The Khronos Group Inc. OpenCL(TM) is a trademark of Apple Inc. +used by permission by Khronos. + + +== Contact + +To report problems with this extension, please open a new issue at: + +https://github.com/intel/llvm/issues + +== Contributors + +Pablo Reble, Intel + +Julian Miller, Intel + +John Pennycook, Intel + +Guo Yejun, Intel + +Dan Holmes, Intel + +Greg Lueck, Intel + +Steffen Larsen, Intel + +Jaime Arteaga Molina, Intel + +Ewan Crawford, Codeplay + +Ben Tracy, Codeplay + +Duncan McBain, Codeplay + +Peter Žužek, Codeplay + +Ruyman Reyes, Codeplay + +Gordon Brown, Codeplay + +Erik Tomusk, Codeplay + +Bjoern Knafla, Codeplay + +Lukas Sommer, Codeplay + +Maxime France-Pillois, Codeplay + +Ronan Keryell, AMD + + +== Dependencies + +This extension is written against the SYCL 2020 revision 6 specification. All +references below to the "core SYCL specification" or to section numbers in the +SYCL specification refer to that revision. + +== Status + +This is a proposed extension specification, intended to gather community +feedback. Interfaces defined in this specification may not be implemented yet +or may be in a preliminary state. The specification itself may also change in +incompatible ways before it is finalized. *Shipping software products should +not rely on APIs defined in this specification.* + +== Introduction + +With command groups SYCL is already able to create an implicit dependency +graph (in the form of a directed acyclic graph) of kernel execution at runtime, +as a command group object defines a set of requisites (edges) which must be +satisfied for commands (nodes) to be executed. However, because command-group +submission is tied to execution on the queue, without having a prior +construction step before starting execution, optimization opportunities are +missed from the runtime not being made aware of a defined dependency graph ahead +of execution. + +The following benefits would become possible if the user could define a +dependency graph to the SYCL runtime prior to execution: + +* Reduction in runtime overhead by only submitting a single graph object, rather + than many individual command groups. + +* Enable more work to be done ahead of time to improve runtime performance. This + early work could be done in a setup phase of the program prior to repeated + executions of the graph. Alternately, a future offline AOT compiler in a different + process could be run prior to the execution of the application. + +* Unlock DMA hardware features through graph analysis by the runtime. + +* Graph optimizations become available, including but not limited to: +** Kernel fusion/fission. +** Inter-node memory reuse from data staying resident on device. +** Identification of the peak intermediate output memory requirement, used for + more optimal memory allocation. + +As well as benefits to the SYCL runtime, there are also advantages to the user +developing SYCL applications, as repetitive workloads no longer have to +redundantly issue the same sequence of commands. Instead, a graph is only +constructed once and submitted for execution as many times as is necessary, only +changing the data in input buffers or USM allocations. For applications from +specific domains, such as machine learning, where the same command group pattern +is run repeatedly for different inputs, this is particularly useful. + +=== Requirements + +In order to achieve the goals described in previous sections, the following +requirements were considered: + +1. Ability to update inputs/outputs of the graph between submissions, without + changing the overall graph structure. +2. Enable low effort porting of existing applications to use the extension. +3. Profiling, debugging, and tracing functionality at the granularity of graph + nodes. +4. Integrate sub-graphs (previously constructed graphs) when constructing a new + graph. +5. Support the USM model of memory as well as buffer/accessor model. +6. Compatible with other SYCL extensions and features, e.g. kernel fusion & + built-in kernels. +7. Ability to record a graph with commands submitted to different devices in the + same context. +8. Capability to serialize graphs to a binary format which can then be + de-serialized and executed. This is helpful for AOT cases where a graph + can be created by an offline tool to be loaded and run without the end-user + incurring the overheads of graph creation. +9. Backend interoperability, the ability to retrieve a native graph object from + the graph and use that in a native backend API. + +To allow for prototype implementations of this extension to be developed +quickly for evaluation the scope of this proposal was limited to a subset +of these requirements. In particular, the serialization functionality (8), +backend interoperability (9), and a profiling/debugging interface (3) were +omitted. As these are not easy to abstract over several backends without +significant investigation. It is also hoped these features can be exposed as +additive changes to the API, and thus introduced in future versions of the +extension. + +Another reason for deferring a serialize/deserialize API (8) is that its scope +could extend from emitting the graph in a binary format, to emitting a +standardized IR format that enables further device specific graph optimizations. + +Multi-device support (7) is something that we are considering introducing into +the extension in later revisions, which may result in API changes. It has been +planned for to the extent that the definition of a graph node is device +specific, however currently all nodes in a graph must target the same device +provided to the graph constructor. + +=== Graph Building Mechanisms + +This extension contains two different API mechanisms for constructing a graph +of commands: + +1. **Explicit graph building API** - Allows users to specify the exact nodes +and edges they want to add to the graph. + +2. **Queue recording API (aka "Record & Replay")** - Introduces state to a +`sycl::queue` such that rather than scheduling commands immediately for +execution, they are added to the graph object instead, with edges captured from +the dependencies of the command group. + +Each of these mechanisms for constructing a graph have their own advantages, so +having both APIs available allows the user to pick the one which is most +suitable for them. The queue recording API allows quicker porting of existing +applications, and can capture external work that is submitted to a queue, for +example via library function calls. While the explicit API can better express +what data is internal to the graph for optimization, and dependencies don't need +to be inferred. + +It is valid to combine these two mechanisms, however it is invalid to modify +a graph using the explicit API while that graph is currently recording commands +from any queue, for example: + +[source, c++] +---- +graph.begin_recording(queue); +graph.add(/*command group*/); // Invalid as graph is recording a queue +graph.end_recording(); +---- + +== Specification + +=== Feature test macro + +This extension provides a feature-test macro as described in the core SYCL +specification section 6.3.3 "Feature test macros". Therefore, an +implementation supporting this extension must predefine the macro +`SYCL_EXT_ONEAPI_GRAPH` to one of the values defined in the table below. +Applications can test for the existence of this macro to determine if the +implementation supports this feature, or applications can test the macro's +value to determine which of the extension's APIs the implementation supports. + +Table {counter: tableNumber}. Values of the `SYCL_EXT_ONEAPI_GRAPH` macro. +[%header,cols="1,5"] +|=== +|Value |Description +|1 |Initial extension version. Base features are supported. +|=== + +=== SYCL Graph Terminology + +Table {counter: tableNumber}. Terminology. +[%header,cols="1,3"] +|=== +| Concept | Description + +| Graph +| A directed and acyclic graph (DAG) of commands (nodes) and their dependencies +(edges), represented by the `command_graph` class. + +| Node +| A command, which can have different attributes, targeting a specific device. + +| Edge +| Dependency between commands as a happens-before relationship. + +|=== + +==== Explicit Graph Building API + +When using the explicit graph building API to construct a graph, nodes and +edges are captured as follows. + +Table {counter: tableNumber}. Explicit Graph Definition. +[%header,cols="1,3"] +|=== +| Concept | Description + +| Node +| In the explicit graph building API nodes are created by the user invoking +methods on a modifiable graph passing a command-group function (CGF). Each node +represents either a command-group or an empty operation. + +| Edge +| In the explicit graph building API edges are primarily defined by the user +through newly added interfaces. This is either using the `make_edge()` function +to define an edge between existing nodes, or using a +`property::node::depends_on` property list when adding a new node to the graph. + +Edges can also be created when explicitly adding nodes to the graph through +existing SYCL mechanisms for expressing dependencies. Data dependencies from +buffer accessors to existing nodes in the graph are captured as an edge. Using +`handler::depends_on()` will also create a graph edge when passed an event +returned from a queue submission captured by a queue recording to the same graph. +|=== + +==== Queue Recording API + +When using the record & replay API to construct a graph by recording a queue, +nodes and edges are captured as follows. + +Table {counter: tableNumber}. Recorded Graph Definition. +[%header,cols="1,3"] +|=== +| Concept | Description + +| Node +| A node in a queue recorded graph represents a command-group submission to the +device associated with the queue being recorded. Nodes are constructed from +the command-group functions (CGF) passed to `queue::submit()`, or from the queue +shortcut equivalents for the defined handler command types. Each submission +encompasses either one or both of a.) some data movement, b.) a single +asynchronous command launch. Nodes cannot define forward edges, only backwards. +That is, nodes can only create dependencies on command-groups that have already +been submitted. + +| Edge +| An edge in a queue recorded graph is expressed through command group +dependencies in one of two ways. Firstly, through buffer accessors that +represent data dependencies between two command groups captured as nodes. +Secondly, by using the `handler::depends_on()` mechanism inside a command group +captured as a node. However, for an event passed to `handler::depends_on()` to +create an edge, it must be an event returned from a queue +submission captured by the same graph. Otherwise, a synchronous error will be +thrown with error code `invalid`. `handler::depends_on()` can be +used to express edges when a user is working with USM memory rather than SYCL +buffers. +|=== + +==== Sub-Graph + +A node in a graph can take the form of a nested sub-graph. This occurs when +a command-group submission that invokes `handler::ext_oneapi_graph()` with an +executable graph object is added to the graph as a node. The child graph node is +scheduled in the parent graph as-if edges are created to connect the root nodes +of the child graph with the dependent nodes of the parent graph. + +Adding an executable graph as a sub-graph does not affect its existing node +dependencies, such that it could be submitted in future without any side +effects of prior uses as a sub-graph. + +=== API Modifications + +[source, c++] +---- +namespace sycl { +namespace ext::oneapi::experimental { + +// State of a queue, returned by queue::ext_oneapi_get_state() +enum class queue_state { + executing, + recording +}; + +namespace property { + +namespace graph { + +class no_cycle_check { + public: + no_cycle_check() = default; +}; + +class no_host_copy { +public: + no_host_copy() = default; +}; +} // namespace graph + +namespace node { + +class depends_on { + public: + template + depends_on(NodeTN... nodes); +}; + +} // namespace node +} // namespace property + +// Device query for level of support +namespace info { +namespace device { + +struct graphs_support; + +} // namespace device + +enum class graph_support_level { + unsupported, + native, + emulated +}; +} // namespace info + +class node {}; + +// State of a graph +enum class graph_state { + modifiable, + executable +}; + +// New object representing graph +template +class command_graph {}; + +template<> +class command_graph { +public: + command_graph(const context& syclContext, const device& syclDevice, + const property_list& propList = {}); + + command_graph + finalize(const property_list& propList = {}) const; + + bool begin_recording(queue& recordingQueue, const property_list& propList = {}); + bool begin_recording(const std::vector& recordingQueues, const property_list& propList = {}); + + bool end_recording(); + bool end_recording(queue& recordingQueue); + bool end_recording(const std::vector& recordingQueues); + + node add(const property_list& propList = {}); + + template + node add(T cgf, const property_list& propList = {}); + + void make_edge(node& src, node& dest); +}; + +template<> +class command_graph { +public: + command_graph() = delete; + void update(const command_graph& graph); +}; +} // namespace ext::oneapi::experimental + +// New methods added to the sycl::queue class +using namespace ext::oneapi::experimental; +class queue { +public: + + ext::oneapi::experimental::queue_state + ext_oneapi_get_state() const; + + /* -- graph convenience shortcuts -- */ + + event ext_oneapi_graph(command_graph& graph); + event ext_oneapi_graph(command_graph& graph, + event depEvent); + event ext_oneapi_graph(command_graph& graph, + const std::vector& depEvents); +}; + +// New methods added to the sycl::handler class +class handler { +public: + void ext_oneapi_graph(command_graph& graph); +} + +} // namespace sycl +---- + +=== Device Info Query + +Due to the experimental nature of the extension, support is not available across +all devices. The following device support query is added to report devices which +are currently supported, and how that support is implemented. + + +Table {counter: tableNumber}. Device Info Queries. +[%header] +|=== +| Device Descriptors | Return Type | Description + +|`info::device::graph_support` +|`info::device::graph_support_level` +|When passed to `device::get_info<...>()`, the function returns `native` +if there is an underlying SYCL backend command-buffer construct which is used +to propagate the graph to the backend. If no backend construct exists, or +building on top of it has not yet been implemented, then `emulated` is +returned. Otherwise `unsupported` is returned if the SYCL device doesn't +support using this graph extension. + +|=== + + +=== Node + +:crs: https://www.khronos.org/registry/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:reference-semantics + +Node is a class that encapsulates tasks like SYCL kernel functions, memory +operations, or host tasks for deferred execution. A graph must +be created first, the structure of a graph is defined second by adding nodes and +edges. + +The `node` class provides the {crs}[common reference semantics]. + +[source,c++] +---- +namespace sycl::ext::oneapi::experimental { + class node {}; +} +---- + +==== Depends-On Property + +The API for explicitly adding nodes to a `command_graph` includes a +`property_list` parameter. This extension defines the `depends_on` property to +be passed here. `depends_on` defines any `node` objects for the created node to +be dependent on, and therefore form an edge with. These nodes are in addition to +the dependent nodes identified from the command-group requisites of the created +node. + +[source,c++] +---- +namespace sycl::ext::oneapi::experimental::property::node +class depends_on { + public: + template + depends_on(NodeTN... nodes); +}; +} +---- + +=== Graph + +This extension adds a new `command_graph` object which follows the +{crs}[common reference semantics] of other SYCL runtime objects. + +A `command_graph` represents a directed acyclic graph of nodes, where each node +represents a single command for a specific device or a sub-graph. The execution +of a graph completes when all its nodes have completed. + +A `command_graph` is built up by either recording queue submissions or +explicitly adding nodes, then once the user is happy that the graph is complete, +the graph instance is finalized into an executable variant which can have no +more nodes added to it. Finalization may be a computationally expensive +operation as the runtime can perform optimizations based on the graph +structure. After finalization the graph can be submitted for execution on a +queue one or more times with reduced overhead. + +==== Graph State + +An instance of a `command_graph` object can be in one of two states: + +* **Modifiable** - Graph is under construction and new nodes may be added to it. +* **Executable** - Graph topology is fixed after finalization and graph is ready to + be submitted for execution. + +A `command_graph` object is constructed in the _modifiable_ state and is made +_executable_ by the user invoking `command_graph::finalize()` to create a +new executable instance of the graph. An executable graph cannot be converted +to a modifiable graph. After finalizing a graph in the modifiable state, it is +valid for a user to add additional nodes and finalize again to create subsequent +executable graphs. The state of a `command_graph` object is made explicit by +templating on state to make the class strongly typed, with the default template +argument being `graph_state::modifiable` to reduce code verbosity on +construction. + +.Graph State Diagram +[source, mermaid] +.... +graph LR + Modifiable -->|Finalize| Executable +.... + +==== Graph Properties [[graph-properties]] + +===== No-Host-Copy Property + +The `no_host_copy` property is defined by this extension and can be passed to +either the `command_graph` constructor or the `command_graph::begin_recording()` +member function. This property will disable the host data copy that may +occur as detailed in the <> section of +this specification. + +Passing this property represents a promise from the user that host data +associated with a buffer that was created using a host data pointer will +outlive any executable graphs created from a modifiable graph which uses +that buffer. + +===== No-Cycle-Check Property + +The `property::graph::no_cycle_check` property disables any checks if a newly +added dependency will lead to a cycle in a specific `command_graph` and can be +passed to a `command_graph` on construction via the property list parameter. +As a result, no errors are reported when a function tries to create a cyclic +dependency. Thus, it's the user's responsibility to create an acyclic graph +for execution when this property is set. Creating a cycle in a `command_graph` +puts that `command_graph` into an undefined state. Any further operations +performed on a `command_graph` in this state will result in undefined +behavior. + +==== Executable Graph Update + +A graph in the executable state can have each nodes inputs & outputs updated +using the `command_graph::update()` method. This takes a graph in the +modifiable state and updates the executable graph to use the node input & +outputs of the modifiable graph, a technique called _Whole Graph Update_. The +modifiable graph must have the same topology as the graph originally used to +create the executable graphs, with the nodes targeting the same devices and +added in the same order. + +==== Graph Member Functions + +Table {counter: tableNumber}. Constructor of the `command_graph` class. +[cols="2a,a"] +|=== +|Constructor|Description + +| +[source,c++] +---- +command_graph(const context& syclContext, + const device& syclDevice, + const property_list& propList = {}); +---- +|Creates a SYCL `command_graph` object in the modifiable state for context +`syclContext` and device `syclDevice`. Zero or more properties can be provided +to the constructed SYCL `command_graph` via an instance of `property_list`. + +Preconditions: + +* This constructor is only available when the `command_graph` state is + `graph_state::modifiable`. + +Parameters: + +* `syclContext` - Context which will be associated with this graph and all + nodes within it. This is an immutable characteristic of the graph. + +* `syclDevice` - Device that all nodes added to the graph will target, + an immutable characteristic of the graph. Must be associated with + `syclContext`. + +* `propList` - Optional parameter for passing properties. Valid `command_graph` + constructor properties are listed in Section <>. + +Exceptions: + +* Throws synchronously with error code `invalid` if `syclDevice` is not +associated with `syclContext`. + +* Throws synchronously with error code `invalid` if `syclDevice` + <>. + +|=== + +Table {counter: tableNumber}. Member functions of the `command_graph` class. +[cols="2a,a"] +|=== +|Member function|Description + +| +[source,c++] +---- +node add(const property_list& propList = {}); +---- +|This creates an empty node which contains no command. Its intended use is +to make a connection point inside a graph between groups of nodes, and can +significantly reduce the number of edges ( O(n) vs. O(n^2^) ). + +Preconditions: + +* This member function is only available when the `command_graph` state is + `graph_state::modifiable`. + +Parameters: + +* `propList` - Zero or more properties can be provided to the constructed node + via an instance of `property_list`. The `property::node::depends_on` property + can be passed here with a list of nodes to create dependency edges on. + + +Returns: The empty node which has been added to the graph. + +Exceptions: + +* Throws synchronously with error code `invalid` if a queue is recording + commands to the graph. + +| +[source,c++] +---- +template +node add(T cgf, const property_list& propList = {}); +---- +|The `cgf` command group function behaves in much the same way as the command +group function passed to `queue::submit` unless explicitly stated otherwise in +<>. Code in the +function is executed synchronously, before the function returns back to +`command_graph::add`, with the exception of any SYCL commands (e.g. kernels, +host tasks, or explicit memory copy operations). These commands are captured +into the graph and executed asynchronously when the graph is submitted to a +queue. The requisites of `cgf` will be used to identify any dependent nodes in +the graph to form edges with. + +Preconditions: + +* This member function is only available when the `command_graph` state is + `graph_state::modifiable`. + +Parameters: + +* `cgf` - Command group function object to be added as a node. + +* `propList` - Zero or more properties can be provided to the constructed node + via an instance of `property_list`. The `property::node::depends_on` property + can be passed here with a list of nodes to create dependency edges on. + +Returns: The command-group function object node which has been added to the graph. + +Exceptions: + +* Throws synchronously with error code `invalid` if a queue is recording + commands to the graph. + +| +[source,c++] +---- +void make_edge(node& src, node& dest); +---- + +|Creates a dependency between two nodes representing a happens-before relationship. + +Preconditions: + +* This member function is only available when the `command_graph` state is + `graph_state::modifiable`. + +Parameters: + +* `src` - Node which will be a dependency of `dest`. + +* `dest` - Node which will be dependent on `src`. + +Exceptions: + +* Throws synchronously with error code `invalid` if a queue is recording + commands to the graph object. + +* Throws synchronously with error code `invalid` if `src` or `dest` + are not valid nodes assigned to the graph object. + +* Throws synchronously with error code `invalid` if `src` and `dest` + are the same node. + +* Throws synchronously with error code `invalid` if the resulting dependency would + lead to a cycle. This error is omitted when `property::graph::no_cycle_check` is set. + +| +[source,c++] +---- +command_graph +finalize(const property_list& propList = {}) const; +---- + +|Synchronous operation that creates a new graph in the executable state with a +fixed topology that can be submitted for execution on any queue sharing the +context associated with the graph. It is valid to call this method multiple times +to create subsequent executable graphs. It is also valid to continue to add new +nodes to the modifiable graph instance after calling this function. It is valid +to finalize an empty graph instance with no recorded commands. + +Preconditions: + +* This member function is only available when the `command_graph` state is + `graph_state::modifiable`. + +Parameters: + +* `propList` - Optional parameter for passing properties. No finalization + properties are defined by this extension. + +Returns: A new executable graph object which can be submitted to a queue. + +|=== + +Table {counter: tableNumber}. Member functions of the `command_graph` class for queue recording. +[cols="2a,a"] +|=== +|Member function|Description + +| +[source, c++] +---- +bool +begin_recording(queue& recordingQueue, + const property_list& propList = {}) +---- + +|Synchronously changes the state of `recordingQueue` to the +`queue_state::recording` state. + +Parameters: + +* `recordingQueue` - A `sycl::queue` object to change to the + `queue_state::recording` state and start recording commands to the graph + instance. + +* `propList` - Optional parameter for passing properties. Properties for + the `command_graph` class are defined in <>. + +Returns: `true` if `recordingQueue` has its state changed from +`queue_state::executing` to `queue_state::recording`, `false` otherwise. + +Exceptions: + +* Throws synchronously with error code `invalid` if `recordingQueue` is + already recording to a different graph. + +* Throws synchronously with error code `invalid` if `recordingQueue` is + associated with a device or context that is different from the device + and context used on creation of the graph. +| +[source, c++] +---- +bool +begin_recording(const std::vector& recordingQueues, + const property_list& propList = {}) +---- + +|Synchronously changes the state of each queue in `recordingQueues` to the +`queue_state::recording` state. + +Parameters: + +* `recordingQueues` - List of `sycl::queue` objects to change to the + `queue_state::recording` state and start recording commands to the graph + instance. + +* `propList` - Optional parameter for passing properties. Properties for + the `command_graph` class are defined in <>. + +Returns: `true` if any queue in `recordingQueues` has its state changed from +`queue_state::executing` to `queue_state::recording`, `false` otherwise. + +Exceptions: + +* Throws synchronously with error code `invalid` if the any queue in + `recordingQueues` is already recording to a different graph. + +* Throws synchronously with error code `invalid` if any of `recordingQueues` + is associated with a device or context that is different from the device + and context used on creation of the graph. + +| +[source, c++] +---- +bool end_recording() +---- + +|Synchronously finishes recording on all queues that are recording to the +graph and sets their state to `queue_state::executing`. + +Returns: `true` if any queue recording to the graph has its state changed from +`queue_state::recording` to `queue_state::executing`, `false` otherwise. + +| +[source, c++] +---- +bool end_recording(queue& recordingQueue) +---- + +|Synchronously changes the state of `recordingQueue` to the +`queue_state::executing` state. + +Parameters: + +* `recordingQueue` - A `sycl::queue` object to change to the executing state. + +Returns: `true` if `recordingQueue` has its state changed from +`queue_state::recording` to `queue_state::executing`, `false` otherwise. + +Exceptions: + +* Throws synchronously with error code `invalid` if `recordingQueue` is + recording to a different graph. + +| +[source, c++] +---- +bool end_recording(const std::vector& recordingQueues) +---- + +|Synchronously changes the state of each queue in `recordingQueues` to the +`queue_state::executing` state. + +Parameters: + +* `recordingQueues` - List of `sycl::queue` objects to change to the executing + state. + +Returns: `true` if any queue in `recordingQueues` has its state changed from +`queue_state::recording` to `queue_state::executing`, `false` otherwise. + +Exceptions: + +* Throws synchronously with error code `invalid` if any queue in + `recordingQueues` is recording to a different graph. + +|=== + +:sycl-kernel-function: https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sycl-kernel-function + +Table {counter: tableNumber}. Member functions of the `command_graph` class (executable graph update). +[cols="2a,a"] +|=== +|Member function|Description + +| +[source, c++] +---- +void +update(const command_graph& graph); +---- + + +|Updates the executable graph node inputs & outputs from a topologically +identical modifiable graph. A topologically identical graph is one with the +same structure of nodes and edges, and the nodes added in the same order to +both graphs. Equivalent nodes in topologically identical graphs each have the +same command, targeting the same device. There is the additional limitation that +to update an executable graph, every node in the graph must be either a kernel +command or a host task. + +The only characteristic that can differ between two topologically identical +graphs during an update are the arguments to kernel nodes. For example, +the graph may capture different values for the USM pointers or accessors used +in the graph. It is these kernels arguments in `graph` that constitute the +inputs & outputs to update to. + +Differences in the following characteristics between two graphs during an +update results in undefined behavior: + +* Modifying the native C++ callable of a `host task` node. +* Modifying the {sycl-kernel-function}[kernel function] of a kernel node. + +The effects of the update will be visible on the next submission of the +executable graph without the need for additional user synchronization. + +Preconditions: + +* This member function is only available when the `command_graph` state is + `graph_state::executable`. + +Parameters: + +* `graph` - Modifiable graph object to update graph node inputs & outputs with. + This graph must have the same topology as the original graph used on + executable graph creation. + +Exceptions: + +* Throws synchronously with error code `invalid` if the topology of `graph` is + not the same as the existing graph topology, or if the nodes were not added in + the same order. + +:handler-copy-functions: https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#table.members.handler.copy + +* Throws synchronously with error code `invalid` if `graph` contains any node + which is not a kernel command or host task, e.g. + {handler-copy-functions}[memory operations]. + +* Throws synchronously with error code `invalid` if the context or device + associated with `graph` does not match that of the `command_graph` being + updated. +|=== + +=== Queue Class Modifications + +:queue-class: https://www.khronos.org/registry/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:interface.queue.class + +This extension modifies the {queue-class}[SYCL queue class] such that +<> is introduced to queue objects, allowing an instance to be +put into a mode where command-groups are recorded to a graph rather than +submitted immediately for execution. + +<> are also added to the +`sycl::queue` class in this extension as queue shortcuts for `handler::graph()`. + +==== Queue State + +The `sycl::queue` object can be in either of two states. The default +`queue_state::executing` state is where the queue has its normal semantics of +submitted command-groups being immediately scheduled for asynchronous execution. + +The alternative `queue_state::recording` state is used for graph construction. +Instead of being scheduled for execution, command-groups submitted to the queue +are recorded to a graph object as new nodes for each submission. After recording +has finished and the queue returns to the executing state, the recorded commands are +not then executed, they are transparent to any following queue operations. The state +of a queue can be queried with `queue::ext_oneapi_get_state()`. + +.Queue State Diagram +[source, mermaid] +.... +graph LR + Executing -->|Begin Recording| Recording + Recording -->|End Recording| Executing +.... + +==== Queue Properties + +:queue-properties: https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:queue-properties + +There are {queue-properties}[two properties] defined by the core SYCL +specification that can be passed to a `sycl::queue` on construction via the +property list parameter. They interact with this extension in the following +ways: + +1. `property::queue::in_order` - When a queue is created with the in-order + property, recording its operations results in a straight-line graph, as each + operation has an implicit dependency on the previous operation. However, + a graph submitted to an in-order queue will keep its existing structure such + that the complete graph executes in-order with respect to the other + command-groups submitted to the queue. + +2. `property::queue::enable_profiling` - This property has no effect on graph + recording. When set on the queue a graph is submitted to however, it allows + profiling information to be obtained from the event returned by a graph + submission. As it is not defined how a submitted graph will be split up for + scheduling at runtime, the `uint64_t` timestamp reported from a profiling + query on a graph execution event has the following semantics, which may be + pessimistic about execution time on device. + + * `info::event_profiling::command_submit` - Timestamp when the graph is + submitted to the queue. + * `info::event_profiling::command_start` - Timestamp when the first + command-group node begins running. + * `info::event_profiling::command_end` - Timestamp when the last + command-group node completes execution. + +==== New Queue Member Functions + +Table {counter: tableNumber}. Additional member functions of the `sycl::queue` class. +[cols="2a,a"] +|=== +|Member function|Description + +| +[source,c++] +---- +queue_state +queue::ext_oneapi_get_state() const; +---- + +| Query the <> of the queue. + +Returns: If the queue is in the default state where commands are scheduled +immediately for execution, `queue_state::executing` is returned. Otherwise, +`queue_state::recording` is returned where commands are redirected to a `command_graph` +object. +| +[source,c++] +---- +event +queue::ext_oneapi_graph(command_graph& graph) +---- + +|Queue shortcut function that is equivalent to submitting a command-group +containing `handler::ext_oneapi_graph(graph)`. + +The command status of the event returned will be +`info::event_command_status::running` once any command group node starts +executing on a device, and status `info::event_command_status::complete` once +all the nodes have finished execution. +| +[source,c++] +---- +event +queue::ext_oneapi_graph(command_graph& graph, + event depEvent); +---- + +|Queue shortcut function that is equivalent to submitting a command-group +containing `handler::depends_on(depEvent)` and +`handler::ext_oneapi_graph(graph)`. + +The command status of the event returned will be +`info::event_command_status::running` once any command group node starts +executing on a device, and status `info::event_command_status::complete` once +all the nodes have finished execution. +| +[source,c++] +---- +event +queue::ext_oneapi_graph(command_graph& graph, + const std::vector& depEvents); +---- + +|Queue shortcut function that is equivalent to submitting a command-group +containing `handler::depends_on(depEvents)` and +`handler::ext_oneapi_graph(graph)`. + +The command status of the event returned will be +`info::event_command_status::running` once any command group node starts +executing on a device, and status `info::event_command_status::complete` once +all the nodes have finished execution. +|=== + +==== New Handler Member Functions + +Table {counter: tableNumber}. Additional member functions of the `sycl::handler` class. +[cols="2a,a"] +|=== +|Member function|Description + +| +[source,c++] +---- +void +handler::ext_oneapi_graph(command_graph& graph) +---- + +|Invokes the execution of a graph. Only one instance of `graph` may be executing, +or pending execution, at any time. Concurrent graph execution can be achieved by +finalizing a graph in modifiable state into multiple graphs in executable state. + +Parameters: + +* `graph` - Graph object to execute. + +Exceptions: + +* Throws synchronously with error code `invalid` if the handler is submitted + to a queue which doesn't have a SYCL context which matches the context of + the executable graph. + +* Throws synchronously with error code `invalid` if a previous submission of + `graph` has yet to complete execution. +|=== + +=== Thread Safety + +The new functions in this extension are thread-safe, the same as member +functions of classes in the base SYCL specification. If user code does +not perform synchronization between two threads accessing the same queue, +there is no strong ordering between events on that queue, and the kernel +submissions, recording and finalization will happen in an undefined order. + +When one thread ends recording on a queue while another +thread is submitting work, which kernels will be part of the subsequent +graph is undefined. If user code enforces a total order on the queue +events, then the behavior is well-defined, and will match the observable +total order. + +The returned value from the `queue::ext_oneapi_get_state()` should be +considered immediately stale in multi-threaded usage, as another thread could +have preemptively changed the state of the queue. + +=== Exception Safety + +In addition to the destruction semantics provided by the SYCL +{crs}[common reference semantics], when the last copy of a modifiable +`command_graph` is destroyed recording is ended on any queues that are recording +to that graph, equivalent to `+this->end_recording()+`. + +As a result, users don't need to manually wrap queue recording code in a +`try` / `catch` block to reset the state of recording queues on an exception +back to the executing state. Instead, an uncaught exception destroying the +modifiable graph will perform this action, useful in RAII pattern usage. + +=== Storage Lifetimes [[storage-lifetimes]] + +The lifetime of any buffer recorded as part of a submission +to a command graph will be extended in keeping with the common reference +semantics and buffer synchronization rules in the SYCL specification. It will be +extended either for the lifetime of the graph (including both modifiable graphs +and the executable graphs created from them) or until the buffer is no longer +required by the graph (such as after being replaced through executable graph update). + +If a buffer created with a host data pointer is recorded as part of a submission to +a command graph, the lifetime of that host data will also be extended by taking a +copy of that data inside the buffer. To illustrate, consider the following example: + +[source,c++] +---- +void foo(queue q /* queue in recording mode */ ) { + float data[NUM]; + buffer buf{data, range{NUM}}; + q.submit([&](handler &cgh) { + accessor acc{buf, cgh, read_only}; + cgh.single_task([] { + // use "acc" + }); + }); + // "data" goes out of scope +} +---- + +In this example, the implementation extends the lifetime of the buffer because +it is used in the recorded graph. Because the buffer uses the host memory data, +the implementation also makes an internal copy of that host data. As illustrated +above, that host memory might go out of scope before the recorded graph goes out +of scope, or before the data has been copied to the device. + +The default behavior is to always copy the host data in a case like this, but +this is not necessary if the user knows that the lifetime of the host data +outlives the lifetime of the recorded graph. If the user knows this is the +case, they may use the `graph::no_host_copy` property to avoid the internal +copy. Passing the property to `begin_recording()` will prevent host copies only +for commands recorded before `end_recording()` is called for a given queue. +Passing the property to the `command_graph` constructor will prevent host copies +for all commands recorded to the graph. + +The implementation guarantees that the host memory will not be copied internally +if all the commands accessing this buffer use `access_mode::write` or the +`no_init` property because the host memory is not needed in these cases. +Note, however, that these cases require the application to disable copy-back +as described in <>. + +=== Host Tasks + +:host-task: https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#subsec:interfaces.hosttasks + +A {host-task}[host task] is a native C++ callable, scheduled according to SYCL +dependency rules. It is valid to record a host task as part of graph, though it +may lead to sub-optimal graph performance because a host task node may prevent +the SYCL runtime from submitting the entire executable `command_graph` to the +device at once. + +Host tasks can be updated as part of <> +by replacing the whole node with the new callable. + +[source,c++] +---- +auto node = graph.add([&](sycl::handler& cgh){ + // Host code here is evaluated during the call to add() + cgh.host_task([=](){ + // Code here is evaluated as part of executing the command graph node + }); +}); +---- + +=== Queue Behavior In Recording Mode + +When a queue is placed in recording mode via a call to `command_graph::begin_recording`, +some features of the queue are no longer available because the commands are not +executed during this mode. The general philosophy is to throw an exception at +runtime when a feature is not available, so that there is an obvious indication +of failure. The following list describes the behavior that changes during +recording mode. Features not listed below behave the same in recording mode as +they do in non-recording mode. + +==== Event Limitations + +Events returned from queue submissions when a queue is in the recording state +may only be used as parameters to `handler::depends_on()` or as dependent +events for queue shortcuts like `queue::parallel_for()` for submissions which +are being recorded to the same modifiable `command_graph`. + +- Calling `event::get_info()` or +`event::get_profiling_info()` on an event returned from a queue submission +recorded to a graph will throw synchronously with error code `invalid`. + +- Waiting on an event returned from a queue submission recorded to a graph +will throw synchronously with error code `invalid`. + +- Waiting on a queue in the recording state is an error and will throw +synchronously with error code `invalid`. + +==== Buffer Limitations + +Because of the delayed execution of a recorded graph, it is not possible to support +captured code which relies on the copy-back on destruction behavior of buffers. +Typically, applications would rely on this behavior to do work on the host which +cannot inherently be captured inside a command graph. + +- Thus, when recording to a graph it is an error to submit a command which has +an accessor on a buffer which would cause a write-back to happen. Using an +incompatible buffer in this case will result in a synchronous error being +thrown with error code `invalid`. + +- The copy-back mechanism can be disabled explicitly for buffers with attached host +storage using either `buffer::set_final_data(nullptr)` or +`buffer::set_write_back(false)`. + +- It is also an error to create a host accessor to a buffer which is used in +commands which are currently being recorded to a command graph. Attempting to +construct a host accessor to an incompatible buffer will result in a +synchronous error being thrown with error code `invalid`. + +==== Error Handling + +When a queue is in recording mode asynchronous exceptions will not be +generated, as no device execution is occurring. Synchronous errors specified as +being thrown in the default queue executing state, will still be thrown when a +queue is in the recording state. Queue query methods operate as usual in +recording mode, as opposed to throwing. + +The `command_graph::begin_recording` and `command_graph::end_recording` +entry-points return a `bool` value informing the user whether a related queue +state change occurred. False is returned rather than throwing an exception when +no queue state is changed. This design is because the queues are already in +the state the user desires, so if the function threw an exception in this case, +the application would likely swallow it and then proceed. + +=== Interaction With Other Extensions [[extension-interaction]] + +This section defines the interaction of `sycl_ext_oneapi_graph` with other +extensions. + +==== sycl_ext_oneapi_discard_queue_events + +When recording a `sycl::queue` which has been created with the +`ext::oneapi::property::queue::discard_event` property, it is invalid to +use these events returned from queue submissions to create graph edges. This is +in-keeping with the +link:../supported/sycl_ext_oneapi_discard_queue_events.asciidoc[sycl_ext_oneapi_discard_queue_events] +specification wording that `handler::depends_on()` throws an exception when +passed an invalid event. + +==== sycl_ext_oneapi_enqueue_barrier + +The new handler methods, and queue shortcuts, defined by +link:../supported/sycl_ext_oneapi_enqueue_barrier.asciidoc[sycl_ext_oneapi_enqueue_barrier] +cannot be used in graph nodes. A synchronous exception will be thrown with +error code `invalid` if a user tries to add them to a graph. + +Removing this restriction is something we may look at for future revisions of +`sycl_ext_oneapi_graph`. + +==== sycl_ext_oneapi_memcpy2d + +The new handler methods, and queue shortcuts, defined by +link:../supported/sycl_ext_oneapi_memcpy2d.asciidoc[sycl_ext_oneapi_memcpy2d] +cannot be used in graph nodes. A synchronous exception will be thrown with +error code `invalid` if a user tries to add them to a graph. + +Removing this restriction is something we may look at for future revisions of +`sycl_ext_oneapi_graph`. + +==== sycl_ext_oneapi_queue_priority + +The queue priority property defined by +link:../supported/sycl_ext_oneapi_queue_priority.asciidoc[sycl_ext_oneapi_queue_priority] +is ignored during queue recording. + +==== sycl_ext_oneapi_queue_empty + +The `queue::ext_oneapi_empty()` query defined by the +link:../supported/sycl_ext_oneapi_queue_empty.asciidoc[sycl_ext_oneapi_queue_empty] +extension behaves as normal during queue recording and is not captured to the graph. +Recorded commands are not counted as submitted for the purposes of this query. + +==== sycl_ext_intel_queue_index + +The compute index queue property defined by +link:../supported/sycl_ext_intel_queue_index.asciidoc[sycl_ext_intel_queue_index] +is ignored during queue recording. + +Using this information is something we may look at for future revisions of +`sycl_ext_oneapi_graph`. + +==== sycl_ext_codeplay_kernel_fusion + +As the +link:../experimental/sycl_ext_codeplay_kernel_fusion.asciidoc[sycl_ext_codeplay_kernel_fusion] +extension also introduces state to a `sycl::queue`, there are restrictions on +its usage when combined with `sycl_ext_oneapi_graph`. Exceptions with error code +`invalid` are thrown in the following cases: + +* `fusion_wrapper::start_fusion()` is called when its associated queue + is in the recording state. +* `command_graph::begin_recording()` is called passing a queue in fusion mode. + +The `sycl::ext::codeplay::experimental::property::queue::enable_fusion` property +defined by the extension is ignored by queue recording. + +To enable kernel fusion in a `command_graph` see the +https://github.com/sommerlukas/llvm/blob/proposal/graph-fusion/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph_fusion.asciidoc[sycl_ext_oneapi_graph_fusion extension proposal] +which is layered ontop of `sycl_ext_oneapi_graph`. + +==== sycl_ext_oneapi_kernel_properties + +The new handler methods, and queue shortcuts, defined by +link:../experimental/sycl_ext_oneapi_kernel_properties.asciidoc[sycl_ext_oneapi_kernel_properties] +cannot be used in graph nodes. A synchronous exception will be thrown with error +code `invalid` if a user tries to add them to a graph. + +Removing this restriction is something we may look at for future revisions of +`sycl_ext_oneapi_graph`. + +==== sycl_ext_oneapi_prod + +The new `sycl::queue::ext_oneapi_prod()` method added by +link:../proposed/sycl_ext_oneapi_prod.asciidoc[sycl_ext_oneapi_prod] +behaves as normal during queue recording and is not captured to the graph. +Recorded commands are not counted as submitted for the purposes of its operation. + +==== sycl_ext_oneapi_device_global + +The new handler methods, and queue shortcuts, defined by +link:../proposed/sycl_ext_oneapi_device_global.asciidoc[sycl_ext_oneapi_device_global]. +cannot be used in graph nodes. A synchronous exception will be thrown with error +code `invalid` if a user tries to add them to a graph. + +Removing this restriction is something we may look at for future revisions of +`sycl_ext_oneapi_graph`. + +== Examples + +[NOTE] +==== +The examples below demonstrate intended usage of the extension, but may not be +compatible with the proof-of-concept implementation, as the proof-of-concept +implementation is currently under development. +==== + +Examples for demonstrative purposes only, and may leave out details such as how +input data is set. + +=== Dot Product + +[source,c++] +---- +... + +#include + +int main() { + namespace sycl_ext = sycl::ext::oneapi::experimental; + + const size_t n = 10; + float alpha = 1.0f; + float beta = 2.0f; + float gamma = 3.0f; + + sycl::queue q; + sycl_ext::command_graph g(q.get_context(), q.get_device()); + + float *dotp = sycl::malloc_shared(1, q); + float *x = sycl::malloc_device(n, q); + float *y = sycl::malloc_device(n, q); + float *z = sycl::malloc_device(n, q); + + // Add commands to the graph to create the following topology. + // + // i + // / \ + // a b + // \ / + // c + + /* init data on the device */ + auto node_i = g.add([&](sycl::handler& h) { + h.parallel_for(n, [=](sycl::id<1> it){ + const size_t i = it[0]; + x[i] = 1.0f; + y[i] = 2.0f; + z[i] = 3.0f; + }); + }); + + auto node_a = g.add([&](sycl::handler& h) { + h.parallel_for(sycl::range<1>{n}, [=](sycl::id<1> it) { + const size_t i = it[0]; + x[i] = alpha * x[i] + beta * y[i]; + }); + }, { sycl_ext::property::node::depends_on(node_i)}); + + auto node_b = g.add([&](sycl::handler& h) { + h.parallel_for(sycl::range<1>{n}, [=](sycl::id<1> it) { + const size_t i = it[0]; + z[i] = gamma * z[i] + beta * y[i]; + }); + }, { sycl_ext::property::node::depends_on(node_i)}); + + auto node_c = g.add( + [&](sycl::handler& h) { + h.parallel_for(sycl::range<1>{n}, + sycl::reduction(dotp, 0.0f, std::plus()), + [=](sycl::id<1> it, auto &sum) { + const size_t i = it[0]; + sum += x[i] * z[i]; + }); + }, + { sycl_ext::property::node::depends_on(node_a, node_b)}); + + auto exec = g.finalize(); + + // use queue shortcut for graph submission + q.ext_oneapi_graph(exec).wait(); + + // memory can be freed inside or outside the graph + sycl::free(x, q); + sycl::free(y, q); + sycl::free(z, q); + sycl::free(dotp, q); + + return 0; +} + + +... +---- + +=== Diamond Dependency + +The following snippet of code shows how a SYCL `queue` can be put into a +recording state, which allows a `command_graph` object to be populated by the +command-groups submitted to the queue. Once the graph is complete, recording +finishes on the queue to put it back into the default executing state. The +graph is then finalized so that no more nodes can be added. Lastly, the graph is +submitted in its entirety for execution via +`handler::ext_oneapi_graph(command_graph)`. + +[source, c++] +---- + using namespace sycl; + queue q{default_selector{}}; + + // New object representing graph of command-groups + ext::oneapi::experimental::command_graph graph(q.get_context(), q.get_device()); + { + buffer bufferA{dataA.data(), range<1>{elements}}; + buffer bufferB{dataB.data(), range<1>{elements}}; + buffer bufferC{dataC.data(), range<1>{elements}}; + + // `q` will be put in the recording state where commands are recorded to + // `graph` rather than submitted for execution immediately. + graph.begin_recording(q); + + // Record commands to `graph` with the following topology. + // + // increment_kernel + // / \ + // A->/ A->\ + // / \ + // add_kernel subtract_kernel + // \ / + // B->\ C->/ + // \ / + // decrement_kernel + + q.submit([&](handler& cgh) { + auto pData = bufferA.get_access(cgh); + cgh.parallel_for(range<1>(elements), + [=](item<1> id) { pData[id]++; }); + }); + + q.submit([&](handler& cgh) { + auto pData1 = bufferA.get_access(cgh); + auto pData2 = bufferB.get_access(cgh); + cgh.parallel_for(range<1>(elements), + [=](item<1> id) { pData2[id] += pData1[id]; }); + }); + + q.submit([&](handler& cgh) { + auto pData1 = bufferA.get_access(cgh); + auto pData2 = bufferC.get_access(cgh); + cgh.parallel_for( + range<1>(elements), [=](item<1> id) { pData2[id] -= pData1[id]; }); + }); + + q.submit([&](handler& cgh) { + auto pData1 = bufferB.get_access(cgh); + auto pData2 = bufferC.get_access(cgh); + cgh.parallel_for(range<1>(elements), [=](item<1> id) { + pData1[id]--; + pData2[id]--; + }); + }); + + // queue `q` will be returned to the executing state where commands are + // submitted immediately for extension. + graph.end_recording(); + } + + // Finalize the modifiable graph to create an executable graph that can be + // submitted for execution. + auto exec_graph = graph.finalize(); + + // Execute graph + q.submit([&](handler& cgh) { + cgh.ext_oneapi_graph(exec_graph); + }); + +---- + +== Future Direction + +=== Memory Allocation Nodes + +There is no provided interface for users to define a USM allocation/free +operation belonging to the scope of the graph. It would be error prone and +non-performant to allocate or free memory as a node executed during graph +submission. Instead, such a memory allocation API needs to provide a way to +return a pointer which won't be valid until the allocation is made on graph +finalization, as allocating at finalization is the only way to benefit from +the known graph scope for optimal memory allocation, and even optimize to +eliminate some allocations entirely. + +Such a deferred allocation strategy presents challenges however, and as a result +we recommend instead that prior to graph construction users perform core SYCL +USM allocations to be used in the graph submission. Before to coming to this +recommendation we considered the following explicit graph building interfaces +for adding a memory allocation owned by the graph: + +1. Allocation function returning a reference to the raw pointer, i.e. `void*&`, + which will be instantiated on graph finalization with the location of the + allocated USM memory. + +2. Allocation function returning a handle to the allocation. Applications use + the handle in node command-group functions to access memory when allocated. + +3. Allocation function returning a pointer to a virtual allocation, only backed + with an actual allocation when graph is finalized or submitted. + +Design 1) has the drawback of forcing users to keep the user pointer variable +alive so that the reference is valid, which is unintuitive and is likely to +result in bugs. + +Design 2) introduces a handle object which has the advantages of being a less +error prone way to provide the pointer to the deferred allocation. However, it +requires kernel changes and introduces an overhead above the raw pointers that +are the advantage of USM. + +Design 3) needs specific backend support for deferred allocation. + +=== Device Specific Graph + +A modifiable state `command_graph` contains nodes targeting specific devices, +rather than being a device agnostic representation only tied to devices on +finalization. This allows the implementation to process nodes which require +device information when the command group function is evaluated. For example, +a SYCL reduction implementation may desire the work-group/sub-group size, which +is normally gathered by the runtime from the device associated with the queue. + +This design also enables the future capability for a user to compose a graph +with nodes targeting different devices, allowing the benefits of defining an +execution graph ahead of submission to be extended to multi-device platforms. +Without this capability a user currently has to submit individual single-device +graphs and use events for dependencies, which is a usage model this extension is +aiming to optimize. Automatic load balancing of commands across devices is not a +problem this extension currently aims to solve, it is the responsibility of the +user to decide the device each command will be processed for, not the SYCL +runtime. + +== Issues + +=== Simultaneous Graph Submission + +Enable an instance of a graph in executable state to be submitted for execution +when a previous submission of the same graph has yet to complete execution. + +**UNRESOLVED:** Trending "yes". Backend support for this is inconsistent, but +the runtime could schedule the submissions sequentially for backends which don't +support it. + +=== Multi Device Graph + +Allow an executable graph to contain nodes targeting different devices. + +**UNRESOLVED:** Trending "yes". This feature is something that we are considering +introducing into the extension in later revisions. It has been planned for to the +extent that the definition of a graph node is device specific. + +=== Memory Allocation API + +We would like to provide an API that allows graph scope memory to be +allocated and used in nodes, such that optimizations can be done on +the allocation. No mechanism is currently provided, but see the +section on <> for +some designs being considered. + +**UNRESOLVED:** Trending "yes". Design is under consideration. + +=== Device Agnostic Graph + +Explicit API could support device-agnostic graphs that can be submitted +through queues to a particular device. This issue is related to multi-device +graphs. + +**UNRESOLVED:** Trending "no". Because of current runtime limitations this +can't be implemented with a reasonable effort. + +=== Execution Property + +Current proposal contains extensive extensions to existing API in SYCL. +Can we achieve something similar with user control over the flush behavior +of a queue and providing a handler that can be replayed? + +**UNRESOLVED:** Trending "no". Needs reconsideration of the design and +possible restrictions. + +=== User Guided Scheduling + +For specific workloads it could be beneficial to provide hints to the +runtime how to schedule a command graph onto a device. This info could effect +the scheduling policy like breadth or depth-first, or a combination with a +block size. + +**UNRESOLVED:** Trending "yes". A new property could be added to +the finalize call either extending the basic command graph proposal +or layered as a separate extension proposal. + +== Revision History + +[cols="5,15,15,70"] +[grid="rows"] +[options="header"] +|======================================== +|Rev|Date|Author|Changes + +|1|2023-03-23|Pablo Reble, Ewan Crawford, Ben Tracy, Julian Miller +|Initial public working draft + +|========================================