Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 1 addition & 51 deletions sycl/plugins/level_zero/pi_level_zero.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1307,44 +1307,6 @@ pi_result _pi_context::getAvailableCommandList(
pi_queue Queue, pi_command_list_ptr_t &CommandList, bool UseCopyEngine,
bool AllowBatching, ze_command_queue_handle_t *ForcedCmdQueue) {

// This is a hack. TODO: Proper CommandList allocation per Executable Graph.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this emulation mode to be most useful, in my mind it wouldn't live in a separate branch that we would need to keep up-to-date with the PoC branch, but live in the PoC branch itself.

Rather than removing this code, have you considered making emulation mode an environment variable users can set to ON?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Ewan. The idea of using a separate branch and removing the PI implementation was to ease the review and merging of the code into the mainline. My understanding of the proposed three-step merge process is to provide the Graph API in the emulation mode and extend it with plugin implementations. Thus, SYCL Graph codes could be compiled for all runtimes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, that makes sense as the first part of the 3 stage approach. I'm a bit wary of us trying to maintain a lot of branches, so I think we should sync up on our call later about how soon we want to open that stage 1 PR. If it's ASAP then using sycl-graph-poc-emulation to do that I think is good. However, if it's more delayed and more changes start getting merged into the PoC branch, then maintaining sycl-graph-poc-emulation will be an overhead, and it might just be better to have the path in the PoC branch guarded by a preprocessor macro. Then, once we want to open a DPC++ PR with just stage 1, we can easily remove the macro and just keep the emulation path.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Ewan. The idea of using a separate branch and removing the PI implementation was to ease the review and merging of the code into the mainline. My understanding of the proposed three-step merge process is to provide the Graph API in the emulation mode and extend it with plugin implementations. Thus, SYCL Graph codes could be compiled for all runtimes.

I think the 3-step merge process referred to in the comment on the POC PR is for a future final implementation and not for the POC (which this code is based on). Unless we are planning to close that open PR and resubmit this as the first step, I agree with the idea of having this be optional functionality in the POC guarded by a macro.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will close this in favor of a version that is guarded by the SYCL Graph macro as discussed. I will further split the testing part of this PR into a separate PR.

if( Queue->Properties & PI_EXT_ONEAPI_QUEUE_LAZY_EXECUTION ) {
// TODO: Create new Command List.
if(Queue->LazyCommandListMap.empty()) {
const bool UseCopyEngine = false;
// Adding createCommandList() to LazyCommandListMap
ze_fence_handle_t ZeFence;
ZeStruct<ze_fence_desc_t> ZeFenceDesc;
ze_command_list_handle_t ZeCommandList;

uint32_t QueueGroupOrdinal;
auto &QGroup = Queue->getQueueGroup(UseCopyEngine);
auto &ZeCommandQueue =
//ForcedCmdQueue ? *ForcedCmdQueue :
QGroup.getZeQueue(&QueueGroupOrdinal);
//if (ForcedCmdQueue)
// QueueGroupOrdinal = QGroup.getCmdQueueOrdinal(ZeCommandQueue);

ZeStruct<ze_command_list_desc_t> ZeCommandListDesc;
ZeCommandListDesc.commandQueueGroupOrdinal = QueueGroupOrdinal;

ZE_CALL(zeCommandListCreate, (Queue->Context->ZeContext, Queue->Device->ZeDevice,
&ZeCommandListDesc, &ZeCommandList));

ZE_CALL(zeFenceCreate, (ZeCommandQueue, &ZeFenceDesc, &ZeFence));
std::tie(CommandList, std::ignore) = Queue->LazyCommandListMap.insert(
std::pair<ze_command_list_handle_t, pi_command_list_info_t>(
ZeCommandList, {ZeFence, false, ZeCommandQueue, QueueGroupOrdinal}));

Queue->insertActiveBarriers(CommandList, UseCopyEngine);
//
CommandList->second.ZeFenceInUse = true;
} else {
CommandList = Queue->LazyCommandListMap.begin();
}
return PI_SUCCESS;
}

// Immediate commandlists have been pre-allocated and are always available.
if (Queue->Device->useImmediateCommandLists()) {
CommandList = Queue->getQueueGroup(UseCopyEngine).getImmCmdList();
Expand Down Expand Up @@ -1583,11 +1545,6 @@ void _pi_queue::CaptureIndirectAccesses() {
pi_result _pi_queue::executeCommandList(pi_command_list_ptr_t CommandList,
bool IsBlocking,
bool OKToBatchCommand) {
// When executing a Graph, defer execution if this is a command
// which could be batched (i.e. likely a kernel submission)
if (this->Properties & PI_EXT_ONEAPI_QUEUE_LAZY_EXECUTION && OKToBatchCommand)
return PI_SUCCESS;

bool UseCopyEngine = CommandList->second.isCopy(this);

// If the current LastCommandEvent is the nullptr, then it means
Expand Down Expand Up @@ -3828,14 +3785,7 @@ pi_result piQueueFinish(pi_queue Queue) {
// Flushing cross-queue dependencies is covered by createAndRetainPiZeEventList,
// so this can be left as a no-op.
pi_result piQueueFlush(pi_queue Queue) {
if( Queue->Properties & PI_EXT_ONEAPI_QUEUE_LAZY_EXECUTION ) {

pi_command_list_ptr_t CommandList{};
// TODO:
CommandList = Queue->LazyCommandListMap.begin();

Queue->executeCommandList(CommandList, false, false);
}
(void)Queue;
return PI_SUCCESS;
}

Expand Down
2 changes: 0 additions & 2 deletions sycl/plugins/level_zero/pi_level_zero.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -937,8 +937,6 @@ struct _pi_queue : _pi_object {

// Map of all command lists used in this queue.
pi_command_list_map_t CommandListMap;
// TODO: Assign Graph related command lists to command_graph object
pi_command_list_map_t LazyCommandListMap;

// Helper data structure to hold all variables related to batching
typedef struct CommandBatch {
Expand Down
5 changes: 1 addition & 4 deletions sycl/source/detail/graph_impl.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -33,10 +33,7 @@ void graph_impl::exec(sycl::detail::queue_ptr q) {
}

void graph_impl::exec_and_wait(sycl::detail::queue_ptr q) {
if (MFirst) {
exec(q);
MFirst = false;
}
exec(q);
q->wait();
}

Expand Down
4 changes: 1 addition & 3 deletions sycl/source/detail/graph_impl.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -81,8 +81,6 @@ struct node_impl {
struct graph_impl {
std::set<node_ptr> MRoots;
std::list<node_ptr> MSchedule;
// TODO: Change one time initialization to per executable object
bool MFirst;

graph_ptr MParent;

Expand All @@ -95,7 +93,7 @@ struct graph_impl {
template <typename T>
node_ptr add(graph_ptr impl, T cgf, const std::vector<node_ptr> &dep = {});

graph_impl() : MFirst(true) {}
graph_impl() {}
};

} // namespace detail
Expand Down
6 changes: 0 additions & 6 deletions sycl/source/detail/queue_impl.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -278,12 +278,6 @@ void queue_impl::wait(const detail::code_location &CodeLoc) {
TelemetryEvent = instrumentationProlog(CodeLoc, Name, StreamID, IId);
#endif

if (has_property<ext::oneapi::property::queue::lazy_execution>()) {
const detail::plugin &Plugin = getPlugin();
if (Plugin.getBackend() == backend::ext_oneapi_level_zero)
Plugin.call<detail::PiApiKind::piQueueFlush>(getHandleRef());
}

std::vector<std::weak_ptr<event_impl>> WeakEvents;
std::vector<event> SharedEvents;
{
Expand Down
9 changes: 4 additions & 5 deletions sycl/test/graph/graph-explicit-dotp.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -83,16 +83,15 @@ int main() {
// Using shortcut for executing a graph of commands
q.exec_graph(executable_graph).wait();

if (*dotp != host_gold_result()) {
std::cout << "Error unexpected result!\n";
}
if (*dotp == host_gold_result())
std::cout << "Dot product explicit graph test passed." << std::endl;
else
std::cout << "Dot product explicit graph test failed." << std::endl;

sycl::free(dotp, q);
sycl::free(x, q);
sycl::free(y, q);
sycl::free(z, q);

std::cout << "done.\n";

return 0;
}
13 changes: 11 additions & 2 deletions sycl/test/graph/graph-explicit-queue-shortcuts.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -31,9 +31,18 @@ int main() {
auto e3 = q.exec_graph(executable_graph, e1);
q.exec_graph(executable_graph, {e2, e3}).wait();

sycl::free(arr, q);
bool check = true;
for (int i = 0; i < n; i++) {
if (arr[i] != 1)
check = false;
}

if (check)
std::cout << "Queue shortcuts explicit graph test passed." << std::endl;
else
std::cout << "Queue shortcuts explicit graph test failed." << std::endl;

std::cout << "done " << arr[0] << std::endl;
sycl::free(arr, q);

return 0;
}
7 changes: 5 additions & 2 deletions sycl/test/graph/graph-explicit-reduction.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -28,10 +28,13 @@ int main() {

e.wait();

if (*output == 45)
std::cout << "Reduction explicit graph test passed." << std::endl;
else
std::cout << "Reduction explicit graph test failed." << std::endl;

sycl::free(input, q);
sycl::free(output, q);

std::cout << "done\n";

return 0;
}
67 changes: 67 additions & 0 deletions sycl/test/graph/graph-explicit-repeated-exec.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
// RUN: %clangxx -fsycl -fsycl-targets=%sycl_triple %s -o %t.out
#include <CL/sycl.hpp>
#include <iostream>

#include <sycl/ext/oneapi/experimental/graph.hpp>

int main() {

sycl::property_list properties{
sycl::property::queue::in_order{},
sycl::ext::oneapi::property::queue::lazy_execution{}};

sycl::queue q{sycl::gpu_selector_v, properties};

sycl::ext::oneapi::experimental::command_graph g;

const size_t n = 10;
float *arr = sycl::malloc_shared<float>(n, q);
for (int i = 0; i < n; i++) {
arr[i] = 0;
}

g.add([&](sycl::handler &h) {
h.parallel_for(sycl::range<1>{n}, [=](sycl::id<1> idx) {
size_t i = idx;
arr[i] += 1;
});
});

bool check = true;
for (int i = 0; i < n; i++) {
if (arr[i] != 0)
check = false;
}

auto executable_graph = g.finalize(q.get_context());

for (int i = 0; i < n; i++) {
if (arr[i] != 0)
check = false;
}

q.submit([&](sycl::handler &h) { h.exec_graph(executable_graph); });

for (int i = 0; i < n; i++) {
if (arr[i] != 1)
check = false;
}

q.submit([&](sycl::handler &h) { h.exec_graph(executable_graph); });

for (int i = 0; i < n; i++) {
if (arr[i] != 2)
check = false;
}

if (check)
std::cout << "Repeated execution of an explicit graph test passed."
<< std::endl;
else
std::cout << "Repeated execution of an explicit graph test failed."
<< std::endl;

sycl::free(arr, q);

return 0;
}
26 changes: 21 additions & 5 deletions sycl/test/graph/graph-explicit-simple.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,9 @@ int main() {

const size_t n = 10;
float *arr = sycl::malloc_shared<float>(n, q);
for (int i = 0; i < n; i++) {
arr[i] = 0;
}

g.add([&](sycl::handler &h) {
h.parallel_for(sycl::range<1>{n}, [=](sycl::id<1> idx) {
Expand All @@ -24,19 +27,32 @@ int main() {
});
});

auto result_before_exec1 = arr[0];
bool check = true;
for (int i = 0; i < n; i++) {
if (arr[i] != 0)
check = false;
}

auto executable_graph = g.finalize(q.get_context());

auto result_before_exec2 = arr[0];
for (int i = 0; i < n; i++) {
if (arr[i] != 0)
check = false;
}

q.submit([&](sycl::handler &h) { h.exec_graph(executable_graph); });

auto result = arr[0];
for (int i = 0; i < n; i++) {
if (arr[i] != 1)
check = false;
}

sycl::free(arr, q);
if (check)
std::cout << "Simple explicit graph test passed." << std::endl;
else
std::cout << "Simple explicit graph test failed." << std::endl;

std::cout << "done.\n";
sycl::free(arr, q);

return 0;
}