[SYCL] Enable useful (not random) output from stream #737

againull · 2019-10-16T04:41:58Z

Pool of flush buffers is allocated in local memory. This pool contains
space for each work item in the work group. Each work item writes to
its own space (flush buffer), as a result output from different work
items is not mixed. Data is flushed to global buffer on endl, flush or
when kernel execution is finished. Global buffer contains all output
from the kernel. Offset of the WI's flush buffer in the pool is
calculated only once in __init method. Call to this method is generated
by frontend.

Signed-off-by: Artur Gainullin [email protected]

rolandschulz

Only took a very quick look. Thanks for getting this done so quickly!

rolandschulz · 2019-10-16T07:27:26Z

sycl/include/CL/sycl/stream.hpp

@@ -104,6 +107,15 @@ class stream {

  bool operator!=(const stream &LHS) const;

+  ~stream() {


The stream class is copyable. I don't think we want to flush if a copy of the stream class is deleted. Only when the last one is deleted / the kernel finishes should we flush.

Good question, thx. Now everything is ok if stream is copied on the host side. But it is more tricky if stream is copied in the device code, because now I use a field of the stream object to save offset in flush buffer for work item. So 2 objects (with their own offset in same flush buffer) could write to same local space. It looks like I need to use local mem for offset, for exmaple the first several bytes of each flush buffer can store offset in the flush buffer.

It seems you fixed the more important problem that the output is correct if the stream is copied. Did you address my initial comment? It seems the stream class is still causing a flush by calling flushBuffer on each call of the destructor. In the case the destructor of a copy is called this causes an extra flush which isn't documented.

I believe that I have addressed this problem. I do flush only if Offset is not zero, i.e. if there is some data to flush. Offset is shared between copies of the stream object. So if some of the copies flushed data then other copy will not flush because Offset will be 0.

I agree you won't flush multiple times if all destructors are called at the same time because they all are in the same scope. But if you have:

{ //begin of kernel s << "a"; { auto s2 = s; s2 << "b"; } //destructor of s2 s << "c"; } // end of kernel. destructor of s

The current implementation will cause an extra flush between b and c. The example is contrived. But if you e.g. pass the stream by value to a function you would have the same at the end of the function scope.

This isn't blocking. But needs to be addressed at some point.

Oh, I see, thank you for the example. Then I will address this in separate commit.

mlychkov · 2019-10-16T10:25:51Z

sycl/include/CL/sycl/handler.hpp

+            MNDRDesc.GlobalSize.size() != 1 &&
+            MNDRDesc.GlobalSize.size() != 1) {


Maybe there is typo in one of the lines?

You are right, fixed.

asavonic · 2019-10-17T11:34:04Z

sycl/include/CL/sycl/detail/stream_impl.hpp

      // Overflow
      return false;
    New = Cur + Size;
-  } while (!OffsetAcc[0].compare_exchange_strong(Cur, New));
+  } while (!GlobalOffset[0].compare_exchange_strong(Cur, New));


compare_exchange_weak should be enough.
Edit: people told me that there is no weak CAS. Please disregard this comment.

asavonic · 2019-10-17T11:39:50Z

sycl/include/CL/sycl/handler.hpp

+            MNDRDesc.GlobalSize.size() != 1 &&
+            MNDRDesc.GlobalSize.size() != 1) {
+          auto LocalAccImpl = detail::getSyclObjImpl(*LAcc);
+          // If local size is not speicified then consider that there is one


So if user submits a kernel with 1M work-items, and let the RT decide the number of WI in a WG, we're going to allocate local memory as if there were 1M work-items in a workgroup?
That seems not very scalable.

Changed this part. As you said, if user doesn't provide WG size then RT chooses number of WIs in WG. SO we will know this number only during kernel execution in __init. So if WG size is not provided by user, I use global size or info about max WG size for device to allocate local memory (take the minimum). In the worst case decided to allocate 80% of local mem.

asavonic · 2019-10-17T11:44:58Z

sycl/include/CL/sycl/stream.hpp

+  // Offset in the flush buffer
+  mutable unsigned Offset = 0;
+
+  mutable size_t FlushBufferSize = 200;


Just used this constant during development for testing purposes, forgot to remove before uploading. Thanks, fixed.

asavonic · 2019-10-17T11:48:17Z

sycl/test/basic_tests/stream.cpp

@@ -208,7 +208,7 @@ int main() {
    Queue.submit([&](handler &CGH) {
      stream Out(1024, 80, CGH);
      CGH.parallel_for<class stream_string>(
-          range<1>(10), [=](id<1> i) { Out << "Hello, World!\n"; });
+          range<1>(10), [=](id<1> i) { Out << "Hello, World!" << endl; });


What is the difference here?

std::endl usually flushes after adding a newline, this commit looks like it adds a fall-through to a flush for the SYCL implementation of endl to make it equivalent. So it's making sure it flushes and adds a newline now if I understand the change to the stream << operator correctly!

You are right, thx!

Thanks. So what this "stream endl" implementation actually does to flush? Does it write data to the global buffer and changes the atomic offset?
What happens if I don't use endl at all?

Ok, I've found the code that implements this, and it seems it indeed writes data to global buffer.
Still wonder what happens if we don't use endl.

The reason I ask is because you've replaced all "\n" with endl. Can you keep "\n" at least in some cases to test it?

I have added separate test for this. Problem is that commands cleanup is not implemented in scheduler yet (there is a TODO), this is the reason why stream is not flushed on host device for this test case, I have disabled this test for host and added todo.

Ok, so a stream should be flushed at the end of a kernel, but it doesn't because of a missing feature elsewhere. Output that is not followed by endl gets lost. Is this accurate?
If so, then make sure to let users know about this issue. It can be totally unexpected if you assume that SYCL stream behaves similarly to std::cout.

Yes, you are right, for the host device output that is not followed by endl gets lost.

Described this in commit message and added note to the header.

asavonic · 2019-10-17T11:52:45Z

sycl/include/CL/sycl/handler.hpp

@@ -286,6 +286,22 @@ class handler {
      case access::target::local: {
        detail::LocalAccessorBaseHost *LAcc =
            static_cast<detail::LocalAccessorBaseHost *>(Ptr);
+        // Stream implementation creates local accessor with size per work item


Any reason to use local memory that is "partitioned" per WIs instead of just using private memory?

We cannot allocate private memory of size which is known only in runtime, and private memory is usually in registers and probably it will be not good to store data there. We would need to make the flush buffer size a template argument.

You are right, thx!

rolandschulz · 2019-10-17T23:36:43Z

sycl/include/CL/sycl/stream.hpp

@@ -327,6 +383,9 @@ inline const stream &operator<<(const stream &Out,
  switch (RHS) {
  case stream_manipulator::endl:
    Out << '\n';
+  case stream_manipulator::flush:


Should this have a LLVM_FALLTHROUGH or similar?

Thx, fixed!

againull · 2019-10-21T10:46:52Z

In the last commit I have addressed the concern from @rolandschulz regarding copying of stream objects. If you prefer separate PR for this then please let me know.

rolandschulz · 2019-10-21T17:02:07Z

sycl/include/CL/sycl/stream.hpp

@@ -104,6 +107,15 @@ class stream {

  bool operator!=(const stream &LHS) const;

+  ~stream() {


It seems you fixed the more important problem that the output is correct if the stream is copied. Did you address my initial comment? It seems the stream class is still causing a flush by calling flushBuffer on each call of the destructor. In the case the destructor of a copy is called this causes an extra flush which isn't documented.

rolandschulz · 2019-10-21T17:02:32Z

sycl/include/CL/sycl/stream.hpp

+    // the host device. Data is not flushed automatically after kernel execution
+    // because of the missing feature in scheduler.
+    auto Offset = getOffsetPtr(FlushBufs, FlushBufferSize, WIOffset);
+    if ((Offset != nullptr) && *Offset)


is this if statement redundant? It seems this is checked already inside flushBuffer.

Thanks, you are right, fixed.

Pool of flush buffers is allocated in local memory. This pool contains space for each work item in the work group. Each work item writes to its own space (flush buffer), as a result output from different work items is not mixed. Data is flushed to global buffer on endl, flush or when kernel execution is finished. Global buffer contains all output from the kernel. Offset of the WI's flush buffer in the pool is calculated only once in __init method. Call to this method is generated by frontend. In the current implementation user should explicitly flush data on the host device. Data is not flushed automatically after kernel execution because of the missing feature in the scheduler. Signed-off-by: Artur Gainullin <[email protected]>

againull · 2019-10-22T07:28:34Z

For some reasons the second commit that resolved issue with copying of stream objects causes faliures on FPGA target, but passes on all other targets. I need some time to investigate this issue. That is why I propose to commit only the first commit that enables atomic output feature in this PR, if you are ok with this. I have described unresolved issues in the commit message and will resolve them in separate PRs. I have tested these changes on all targets and expect that all tests will pass.

asavonic · 2019-10-22T14:46:42Z

sycl/include/CL/sycl/detail/accessor_impl.hpp

+    if (GlobalSize != 1 && LocalSize != 1) {
+      // If local size is not specified then work group size is chosen by
+      // runtime. That is why try to allocate based on max work group size or
+      // global size. In the worst case allocate 80% of local memory.


This function is weird.. Why do we allocate 80% of local size? In what situation we can exceed 80% of local size and what happens in this case? If we lose some data, can a perfectly legal SYCL code hit this?

Because I think it is not good to allocate 100% of local mem, then it will not be possible to use local mem for other purposes inside the kernel. If [# work items running in parallel] * [flush buffer size provided by user] exceeds 80% of local memory then data for work items that don't have flush buffer will not be printed. But as far as I know maximum number of work items in work group for each device is defined, so this number could not be too big. Theoretically user can provide any flush buffer size, but I am not sure that any implementation can handle any input provided by user. Probably we can write an error message if user provided too big flush buffer size ( if allocated local memory is not enough).

This design (to use local memory) was discussed with @rolandschulz and @dbabokin.
@rolandschulz @dbabokin Could you please also provide your input regarding this question?

Can you defer this computation somehow till after the run-time has chosen the wg-size? Or do we always choose max WG-size/GlobalSize so this matches?

We should throw an exception if the flush buffer is too big.

asavonic · 2019-10-22T14:53:35Z

sycl/include/CL/sycl/detail/stream_impl.hpp

@@ -116,7 +135,7 @@ class stream_impl {

  // Maximum number of symbols which could be streamed from the beginning of a
  // statement till the semicolon
-  size_t MaxStatementSize_;
+  unsigned MaxStatementSize_;


I understand that it's pretty unlikely that someone is going to push more than 4GB to a SYCL stream, but what is the reason for this change? Is it possible to overflow?

I used unsigned type for offsets in the code. I can change this.

asavonic · 2019-10-22T14:56:01Z

sycl/include/CL/sycl/detail/stream_impl.hpp

+                  const char *Str, unsigned Len, unsigned Padding = 0) {
+  if ((FlushBufferSize - Offset < Len + Padding) ||
+      (WIOffset + Offset + Len + Padding > FlushBufs.get_count()))
+    // TODO: flush here


Do you plan to address/remove this todo?

Yes, I planned to address this in further commits.

asavonic · 2019-10-22T15:01:51Z

sycl/include/CL/sycl/stream.hpp

+    // buffer.
+    FlushBufferSize = FlushSize[0].load();
+    WIOffsetAcc[0].store(0);
+    detail::workGroupBarrier();


Is it possible to have WIOffsetAcc zero-initialized and get rid of the barrier?

I haven't found possibility to do so. Please share with me if you have any ideas

So the WIOffsetAcc is a local pointer, and each WI increments it by FlushBufferSize to determine offset for this WI, right? Why not just compute the offset based on a WI id like this: WIOffset = get_local_id() * FlushBufferSize?

Yes, correct. I was not able to use this approach WIOffset = get_local_id() * FlushBufferSize because I need instance of sycl::id object to call get_local_id.

I need instance of sycl::id object to call get_local_id.

You can #include <CL/__spirv/spirv_vars.hpp> and use __spirv_BuiltInLocalInvocationId variables declared there from any function you need. They are lowered to the same SPIR-V as the OpenCL get_local_id().

I didn't know that, this makes sense, thx. Then I will prepare follow up improvements/fixes for committed patch.

Replaced usages of "VectorType::getNumElements" with "FixedVectorType::getNumElements" Signed-off-by: amochalo <[email protected]>

* upstream/sycl: (405 commits) [SYCL] Implement new env var SYCL_DEVICE_FILTER (intel#2239) [Driver][SYCL] Make /MD the default for -fsycl (intel#2478) [SYCL]: basic support of contexts with multiple devices in Level-Zero (intel#2440) [SYCL] Fix LIT regression after 9dd18ca (intel#2481) [SYCL][L0] Kernel Destroy in piKernelRelease (intel#2475) [SYCL] Emit an aliased function only if it is used (intel#2430) [Driver][SYCL] Add defaultlib directive for sycl lib (intel#2464) [Driver][SYCL] Improve situations where .exe is added for AOT tools (intel#2467) [SYCL][L0]: Check Queue refcnt prior to using members in event wait/release (intel#2471) [SYCL] Unroll several loops in __init method accessor class (intel#2449) [SYCL][Doc] Add link to use pinned memory spec (intel#2463) [SYCL] Link SYCL device libraries by default. (intel#2400) Revert "[SYCL] XFAIL test blcoking pulldown" Avoid usage of deprecated "VectorType::getNumElements" (intel#737) Fix nullptr dereference (intel#741) Do not translate arbitrary precision operations without corresponding extensions (intel#714) Add Constrained Floating-Point Intrinsics support [SYCL] Take into account auxiliary cmake options for Level Zero loader [InstCombine] improve fold of pointer differences [InstCombine] add ptr difference tests; NFC ...

againull requested a review from romanovvlad October 16, 2019 04:42

againull assigned romanovvlad Oct 16, 2019

rolandschulz reviewed Oct 16, 2019

View reviewed changes

mlychkov reviewed Oct 16, 2019

View reviewed changes

romanovvlad assigned asavonic and unassigned romanovvlad Oct 16, 2019

romanovvlad requested a review from asavonic October 16, 2019 11:33

asavonic reviewed Oct 17, 2019

View reviewed changes

rolandschulz reviewed Oct 17, 2019

View reviewed changes

againull force-pushed the atomic_output branch 4 times, most recently from e03e721 to 1abb047 Compare October 21, 2019 10:45

rolandschulz reviewed Oct 21, 2019

View reviewed changes

againull force-pushed the atomic_output branch from 1abb047 to 08b2b59 Compare October 21, 2019 17:32

mkinsner previously approved these changes Oct 21, 2019

View reviewed changes

againull dismissed mkinsner’s stale review via d45acd8 October 22, 2019 07:24

againull force-pushed the atomic_output branch from 08b2b59 to d45acd8 Compare October 22, 2019 07:24

asavonic reviewed Oct 22, 2019

View reviewed changes

mkinsner approved these changes Oct 22, 2019

View reviewed changes

pvchupin merged commit 377b3fa into intel:sycl Oct 22, 2019

againull deleted the atomic_output branch December 3, 2022 00:02

		@@ -104,6 +107,15 @@ class stream {

		bool operator!=(const stream &LHS) const;

		~stream() {

		MNDRDesc.GlobalSize.size() != 1 &&
		MNDRDesc.GlobalSize.size() != 1) {

[SYCL] Enable useful (not random) output from stream #737

[SYCL] Enable useful (not random) output from stream #737

Conversation

againull commented Oct 16, 2019

rolandschulz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

againull Oct 18, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asavonic Oct 17, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agozillon Oct 17, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

againull commented Oct 21, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

againull commented Oct 22, 2019 • edited Loading

Choose a reason for hiding this comment

againull Oct 22, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asavonic Oct 23, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

againull Oct 18, 2019 •

edited

Loading

asavonic Oct 17, 2019 •

edited

Loading

agozillon Oct 17, 2019 •

edited

Loading

againull commented Oct 21, 2019 •

edited

Loading

againull commented Oct 22, 2019 •

edited

Loading

againull Oct 22, 2019 •

edited

Loading

asavonic Oct 23, 2019 •

edited

Loading