Skip to content

[SYCL] Enable useful (not random) output from stream #737

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 22, 2019

Conversation

againull
Copy link
Contributor

Pool of flush buffers is allocated in local memory. This pool contains
space for each work item in the work group. Each work item writes to
its own space (flush buffer), as a result output from different work
items is not mixed. Data is flushed to global buffer on endl, flush or
when kernel execution is finished. Global buffer contains all output
from the kernel. Offset of the WI's flush buffer in the pool is
calculated only once in __init method. Call to this method is generated
by frontend.

Signed-off-by: Artur Gainullin [email protected]

Copy link
Contributor

@rolandschulz rolandschulz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only took a very quick look. Thanks for getting this done so quickly!

@@ -104,6 +107,15 @@ class stream {

bool operator!=(const stream &LHS) const;

~stream() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stream class is copyable. I don't think we want to flush if a copy of the stream class is deleted. Only when the last one is deleted / the kernel finishes should we flush.

Copy link
Contributor Author

@againull againull Oct 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, thx. Now everything is ok if stream is copied on the host side. But it is more tricky if stream is copied in the device code, because now I use a field of the stream object to save offset in flush buffer for work item. So 2 objects (with their own offset in same flush buffer) could write to same local space. It looks like I need to use local mem for offset, for exmaple the first several bytes of each flush buffer can store offset in the flush buffer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems you fixed the more important problem that the output is correct if the stream is copied. Did you address my initial comment? It seems the stream class is still causing a flush by calling flushBuffer on each call of the destructor. In the case the destructor of a copy is called this causes an extra flush which isn't documented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that I have addressed this problem. I do flush only if Offset is not zero, i.e. if there is some data to flush. Offset is shared between copies of the stream object. So if some of the copies flushed data then other copy will not flush because Offset will be 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree you won't flush multiple times if all destructors are called at the same time because they all are in the same scope. But if you have:

{ //begin of kernel
s << "a";
{ 
    auto s2 = s;
    s2 << "b";
} //destructor of s2
s << "c";
}  // end of kernel. destructor of s

The current implementation will cause an extra flush between b and c. The example is contrived. But if you e.g. pass the stream by value to a function you would have the same at the end of the function scope.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't blocking. But needs to be addressed at some point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see, thank you for the example. Then I will address this in separate commit.

Comment on lines 294 to 295
MNDRDesc.GlobalSize.size() != 1 &&
MNDRDesc.GlobalSize.size() != 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe there is typo in one of the lines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, fixed.

@romanovvlad romanovvlad assigned asavonic and unassigned romanovvlad Oct 16, 2019
@romanovvlad romanovvlad requested a review from asavonic October 16, 2019 11:33
// Overflow
return false;
New = Cur + Size;
} while (!OffsetAcc[0].compare_exchange_strong(Cur, New));
} while (!GlobalOffset[0].compare_exchange_strong(Cur, New));
Copy link
Contributor

@asavonic asavonic Oct 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compare_exchange_weak should be enough.
Edit: people told me that there is no weak CAS. Please disregard this comment.

MNDRDesc.GlobalSize.size() != 1 &&
MNDRDesc.GlobalSize.size() != 1) {
auto LocalAccImpl = detail::getSyclObjImpl(*LAcc);
// If local size is not speicified then consider that there is one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if user submits a kernel with 1M work-items, and let the RT decide the number of WI in a WG, we're going to allocate local memory as if there were 1M work-items in a workgroup?
That seems not very scalable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this part. As you said, if user doesn't provide WG size then RT chooses number of WIs in WG. SO we will know this number only during kernel execution in __init. So if WG size is not provided by user, I use global size or info about max WG size for device to allocate local memory (take the minimum). In the worst case decided to allocate 80% of local mem.

// Offset in the flush buffer
mutable unsigned Offset = 0;

mutable size_t FlushBufferSize = 200;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 200?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just used this constant during development for testing purposes, forgot to remove before uploading. Thanks, fixed.

@@ -208,7 +208,7 @@ int main() {
Queue.submit([&](handler &CGH) {
stream Out(1024, 80, CGH);
CGH.parallel_for<class stream_string>(
range<1>(10), [=](id<1> i) { Out << "Hello, World!\n"; });
range<1>(10), [=](id<1> i) { Out << "Hello, World!" << endl; });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference here?

Copy link
Contributor

@agozillon agozillon Oct 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::endl usually flushes after adding a newline, this commit looks like it adds a fall-through to a flush for the SYCL implementation of endl to make it equivalent. So it's making sure it flushes and adds a newline now if I understand the change to the stream << operator correctly!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, thx!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. So what this "stream endl" implementation actually does to flush? Does it write data to the global buffer and changes the atomic offset?
What happens if I don't use endl at all?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've found the code that implements this, and it seems it indeed writes data to global buffer.
Still wonder what happens if we don't use endl.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I ask is because you've replaced all "\n" with endl. Can you keep "\n" at least in some cases to test it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added separate test for this. Problem is that commands cleanup is not implemented in scheduler yet (there is a TODO), this is the reason why stream is not flushed on host device for this test case, I have disabled this test for host and added todo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so a stream should be flushed at the end of a kernel, but it doesn't because of a missing feature elsewhere. Output that is not followed by endl gets lost. Is this accurate?
If so, then make sure to let users know about this issue. It can be totally unexpected if you assume that SYCL stream behaves similarly to std::cout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right, for the host device output that is not followed by endl gets lost.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Described this in commit message and added note to the header.

@@ -286,6 +286,22 @@ class handler {
case access::target::local: {
detail::LocalAccessorBaseHost *LAcc =
static_cast<detail::LocalAccessorBaseHost *>(Ptr);
// Stream implementation creates local accessor with size per work item
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to use local memory that is "partitioned" per WIs instead of just using private memory?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot allocate private memory of size which is known only in runtime, and private memory is usually in registers and probably it will be not good to store data there. We would need to make the flush buffer size a template argument.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, thx!

@@ -327,6 +383,9 @@ inline const stream &operator<<(const stream &Out,
switch (RHS) {
case stream_manipulator::endl:
Out << '\n';
case stream_manipulator::flush:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this have a LLVM_FALLTHROUGH or similar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, fixed!

@againull againull force-pushed the atomic_output branch 4 times, most recently from e03e721 to 1abb047 Compare October 21, 2019 10:45
@againull
Copy link
Contributor Author

againull commented Oct 21, 2019

In the last commit I have addressed the concern from @rolandschulz regarding copying of stream objects. If you prefer separate PR for this then please let me know.

@@ -104,6 +107,15 @@ class stream {

bool operator!=(const stream &LHS) const;

~stream() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems you fixed the more important problem that the output is correct if the stream is copied. Did you address my initial comment? It seems the stream class is still causing a flush by calling flushBuffer on each call of the destructor. In the case the destructor of a copy is called this causes an extra flush which isn't documented.

// the host device. Data is not flushed automatically after kernel execution
// because of the missing feature in scheduler.
auto Offset = getOffsetPtr(FlushBufs, FlushBufferSize, WIOffset);
if ((Offset != nullptr) && *Offset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this if statement redundant? It seems this is checked already inside flushBuffer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, you are right, fixed.

mkinsner
mkinsner previously approved these changes Oct 21, 2019
Pool of flush buffers is allocated in local memory. This pool contains
space for each work item in the work group. Each work item writes to
its own space (flush buffer), as a result output from different work
items is not mixed. Data is flushed to global buffer on endl, flush or
when kernel execution is finished.  Global buffer contains all output
from the kernel. Offset of the WI's flush buffer in the pool is
calculated only once in __init method. Call to this method is generated
by frontend.

In the current implementation user should explicitly flush data on the
host device. Data is not flushed automatically after kernel execution
because of the missing feature in the scheduler.

Signed-off-by: Artur Gainullin <[email protected]>
@againull
Copy link
Contributor Author

againull commented Oct 22, 2019

For some reasons the second commit that resolved issue with copying of stream objects causes faliures on FPGA target, but passes on all other targets. I need some time to investigate this issue. That is why I propose to commit only the first commit that enables atomic output feature in this PR, if you are ok with this. I have described unresolved issues in the commit message and will resolve them in separate PRs. I have tested these changes on all targets and expect that all tests will pass.

if (GlobalSize != 1 && LocalSize != 1) {
// If local size is not specified then work group size is chosen by
// runtime. That is why try to allocate based on max work group size or
// global size. In the worst case allocate 80% of local memory.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is weird.. Why do we allocate 80% of local size? In what situation we can exceed 80% of local size and what happens in this case? If we lose some data, can a perfectly legal SYCL code hit this?

Copy link
Contributor Author

@againull againull Oct 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I think it is not good to allocate 100% of local mem, then it will not be possible to use local mem for other purposes inside the kernel. If [# work items running in parallel] * [flush buffer size provided by user] exceeds 80% of local memory then data for work items that don't have flush buffer will not be printed. But as far as I know maximum number of work items in work group for each device is defined, so this number could not be too big. Theoretically user can provide any flush buffer size, but I am not sure that any implementation can handle any input provided by user. Probably we can write an error message if user provided too big flush buffer size ( if allocated local memory is not enough).

This design (to use local memory) was discussed with @rolandschulz and @dbabokin.
@rolandschulz @dbabokin Could you please also provide your input regarding this question?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you defer this computation somehow till after the run-time has chosen the wg-size? Or do we always choose max WG-size/GlobalSize so this matches?

We should throw an exception if the flush buffer is too big.

@@ -116,7 +135,7 @@ class stream_impl {

// Maximum number of symbols which could be streamed from the beginning of a
// statement till the semicolon
size_t MaxStatementSize_;
unsigned MaxStatementSize_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that it's pretty unlikely that someone is going to push more than 4GB to a SYCL stream, but what is the reason for this change? Is it possible to overflow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used unsigned type for offsets in the code. I can change this.

const char *Str, unsigned Len, unsigned Padding = 0) {
if ((FlushBufferSize - Offset < Len + Padding) ||
(WIOffset + Offset + Len + Padding > FlushBufs.get_count()))
// TODO: flush here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you plan to address/remove this todo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I planned to address this in further commits.

// buffer.
FlushBufferSize = FlushSize[0].load();
WIOffsetAcc[0].store(0);
detail::workGroupBarrier();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to have WIOffsetAcc zero-initialized and get rid of the barrier?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't found possibility to do so. Please share with me if you have any ideas

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the WIOffsetAcc is a local pointer, and each WI increments it by FlushBufferSize to determine offset for this WI, right? Why not just compute the offset based on a WI id like this: WIOffset = get_local_id() * FlushBufferSize?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, correct. I was not able to use this approach WIOffset = get_local_id() * FlushBufferSize because I need instance of sycl::id object to call get_local_id.

Copy link
Contributor

@asavonic asavonic Oct 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need instance of sycl::id object to call get_local_id.

You can #include <CL/__spirv/spirv_vars.hpp> and use __spirv_BuiltInLocalInvocationId variables declared there from any function you need. They are lowered to the same SPIR-V as the OpenCL get_local_id().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know that, this makes sense, thx. Then I will prepare follow up improvements/fixes for committed patch.

@pvchupin pvchupin merged commit 377b3fa into intel:sycl Oct 22, 2019
vladimirlaz pushed a commit to vladimirlaz/llvm that referenced this pull request Sep 13, 2020
Replaced usages of "VectorType::getNumElements" with "FixedVectorType::getNumElements"

Signed-off-by: amochalo <[email protected]>
alexbatashev pushed a commit to alexbatashev/llvm that referenced this pull request Sep 17, 2020
* upstream/sycl: (405 commits)
  [SYCL] Implement new env var SYCL_DEVICE_FILTER (intel#2239)
  [Driver][SYCL] Make /MD the default for -fsycl (intel#2478)
  [SYCL]: basic support of contexts with multiple devices in Level-Zero (intel#2440)
  [SYCL] Fix LIT regression after 9dd18ca (intel#2481)
  [SYCL][L0] Kernel Destroy in piKernelRelease (intel#2475)
  [SYCL] Emit an aliased function only if it is used (intel#2430)
  [Driver][SYCL] Add defaultlib directive for sycl lib (intel#2464)
  [Driver][SYCL] Improve situations where .exe is added for AOT tools (intel#2467)
  [SYCL][L0]: Check Queue refcnt prior to using members in event wait/release (intel#2471)
  [SYCL] Unroll several loops in __init method accessor class (intel#2449)
  [SYCL][Doc] Add link to use pinned memory spec (intel#2463)
  [SYCL] Link SYCL device libraries by default. (intel#2400)
  Revert "[SYCL] XFAIL test blcoking pulldown"
  Avoid usage of deprecated "VectorType::getNumElements" (intel#737)
  Fix nullptr dereference (intel#741)
  Do not translate arbitrary precision operations without corresponding extensions (intel#714)
  Add Constrained Floating-Point Intrinsics support
  [SYCL] Take into account auxiliary cmake options for Level Zero loader
  [InstCombine] improve fold of pointer differences
  [InstCombine] add ptr difference tests; NFC
  ...
@againull againull deleted the atomic_output branch December 3, 2022 00:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants