-
Notifications
You must be signed in to change notification settings - Fork 769
[SYCL] Enable useful (not random) output from stream #737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only took a very quick look. Thanks for getting this done so quickly!
@@ -104,6 +107,15 @@ class stream { | |||
|
|||
bool operator!=(const stream &LHS) const; | |||
|
|||
~stream() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stream class is copyable. I don't think we want to flush if a copy of the stream class is deleted. Only when the last one is deleted / the kernel finishes should we flush.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, thx. Now everything is ok if stream is copied on the host side. But it is more tricky if stream is copied in the device code, because now I use a field of the stream object to save offset in flush buffer for work item. So 2 objects (with their own offset in same flush buffer) could write to same local space. It looks like I need to use local mem for offset, for exmaple the first several bytes of each flush buffer can store offset in the flush buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems you fixed the more important problem that the output is correct if the stream is copied. Did you address my initial comment? It seems the stream class is still causing a flush by calling flushBuffer on each call of the destructor. In the case the destructor of a copy is called this causes an extra flush which isn't documented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that I have addressed this problem. I do flush only if Offset is not zero, i.e. if there is some data to flush. Offset is shared between copies of the stream object. So if some of the copies flushed data then other copy will not flush because Offset will be 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree you won't flush multiple times if all destructors are called at the same time because they all are in the same scope. But if you have:
{ //begin of kernel
s << "a";
{
auto s2 = s;
s2 << "b";
} //destructor of s2
s << "c";
} // end of kernel. destructor of s
The current implementation will cause an extra flush between b and c. The example is contrived. But if you e.g. pass the stream by value to a function you would have the same at the end of the function scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't blocking. But needs to be addressed at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see, thank you for the example. Then I will address this in separate commit.
sycl/include/CL/sycl/handler.hpp
Outdated
MNDRDesc.GlobalSize.size() != 1 && | ||
MNDRDesc.GlobalSize.size() != 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe there is typo in one of the lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, fixed.
// Overflow | ||
return false; | ||
New = Cur + Size; | ||
} while (!OffsetAcc[0].compare_exchange_strong(Cur, New)); | ||
} while (!GlobalOffset[0].compare_exchange_strong(Cur, New)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compare_exchange_weak
should be enough.
Edit: people told me that there is no weak CAS. Please disregard this comment.
sycl/include/CL/sycl/handler.hpp
Outdated
MNDRDesc.GlobalSize.size() != 1 && | ||
MNDRDesc.GlobalSize.size() != 1) { | ||
auto LocalAccImpl = detail::getSyclObjImpl(*LAcc); | ||
// If local size is not speicified then consider that there is one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if user submits a kernel with 1M work-items, and let the RT decide the number of WI in a WG, we're going to allocate local memory as if there were 1M work-items in a workgroup?
That seems not very scalable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed this part. As you said, if user doesn't provide WG size then RT chooses number of WIs in WG. SO we will know this number only during kernel execution in __init. So if WG size is not provided by user, I use global size or info about max WG size for device to allocate local memory (take the minimum). In the worst case decided to allocate 80% of local mem.
sycl/include/CL/sycl/stream.hpp
Outdated
// Offset in the flush buffer | ||
mutable unsigned Offset = 0; | ||
|
||
mutable size_t FlushBufferSize = 200; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 200
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just used this constant during development for testing purposes, forgot to remove before uploading. Thanks, fixed.
@@ -208,7 +208,7 @@ int main() { | |||
Queue.submit([&](handler &CGH) { | |||
stream Out(1024, 80, CGH); | |||
CGH.parallel_for<class stream_string>( | |||
range<1>(10), [=](id<1> i) { Out << "Hello, World!\n"; }); | |||
range<1>(10), [=](id<1> i) { Out << "Hello, World!" << endl; }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::endl usually flushes after adding a newline, this commit looks like it adds a fall-through to a flush for the SYCL implementation of endl to make it equivalent. So it's making sure it flushes and adds a newline now if I understand the change to the stream << operator correctly!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, thx!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. So what this "stream endl" implementation actually does to flush? Does it write data to the global buffer and changes the atomic offset?
What happens if I don't use endl at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I've found the code that implements this, and it seems it indeed writes data to global buffer.
Still wonder what happens if we don't use endl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I ask is because you've replaced all "\n" with endl. Can you keep "\n" at least in some cases to test it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added separate test for this. Problem is that commands cleanup is not implemented in scheduler yet (there is a TODO), this is the reason why stream is not flushed on host device for this test case, I have disabled this test for host and added todo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so a stream should be flushed at the end of a kernel, but it doesn't because of a missing feature elsewhere. Output that is not followed by endl
gets lost. Is this accurate?
If so, then make sure to let users know about this issue. It can be totally unexpected if you assume that SYCL stream behaves similarly to std::cout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right, for the host device output that is not followed by endl gets lost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Described this in commit message and added note to the header.
@@ -286,6 +286,22 @@ class handler { | |||
case access::target::local: { | |||
detail::LocalAccessorBaseHost *LAcc = | |||
static_cast<detail::LocalAccessorBaseHost *>(Ptr); | |||
// Stream implementation creates local accessor with size per work item |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to use local memory that is "partitioned" per WIs instead of just using private memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot allocate private memory of size which is known only in runtime, and private memory is usually in registers and probably it will be not good to store data there. We would need to make the flush buffer size a template argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, thx!
@@ -327,6 +383,9 @@ inline const stream &operator<<(const stream &Out, | |||
switch (RHS) { | |||
case stream_manipulator::endl: | |||
Out << '\n'; | |||
case stream_manipulator::flush: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this have a LLVM_FALLTHROUGH or similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx, fixed!
e03e721
to
1abb047
Compare
In the last commit I have addressed the concern from @rolandschulz regarding copying of stream objects. If you prefer separate PR for this then please let me know. |
@@ -104,6 +107,15 @@ class stream { | |||
|
|||
bool operator!=(const stream &LHS) const; | |||
|
|||
~stream() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems you fixed the more important problem that the output is correct if the stream is copied. Did you address my initial comment? It seems the stream class is still causing a flush by calling flushBuffer on each call of the destructor. In the case the destructor of a copy is called this causes an extra flush which isn't documented.
sycl/include/CL/sycl/stream.hpp
Outdated
// the host device. Data is not flushed automatically after kernel execution | ||
// because of the missing feature in scheduler. | ||
auto Offset = getOffsetPtr(FlushBufs, FlushBufferSize, WIOffset); | ||
if ((Offset != nullptr) && *Offset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this if statement redundant? It seems this is checked already inside flushBuffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, you are right, fixed.
1abb047
to
08b2b59
Compare
Pool of flush buffers is allocated in local memory. This pool contains space for each work item in the work group. Each work item writes to its own space (flush buffer), as a result output from different work items is not mixed. Data is flushed to global buffer on endl, flush or when kernel execution is finished. Global buffer contains all output from the kernel. Offset of the WI's flush buffer in the pool is calculated only once in __init method. Call to this method is generated by frontend. In the current implementation user should explicitly flush data on the host device. Data is not flushed automatically after kernel execution because of the missing feature in the scheduler. Signed-off-by: Artur Gainullin <[email protected]>
08b2b59
to
d45acd8
Compare
For some reasons the second commit that resolved issue with copying of stream objects causes faliures on FPGA target, but passes on all other targets. I need some time to investigate this issue. That is why I propose to commit only the first commit that enables atomic output feature in this PR, if you are ok with this. I have described unresolved issues in the commit message and will resolve them in separate PRs. I have tested these changes on all targets and expect that all tests will pass. |
if (GlobalSize != 1 && LocalSize != 1) { | ||
// If local size is not specified then work group size is chosen by | ||
// runtime. That is why try to allocate based on max work group size or | ||
// global size. In the worst case allocate 80% of local memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is weird.. Why do we allocate 80% of local size? In what situation we can exceed 80% of local size and what happens in this case? If we lose some data, can a perfectly legal SYCL code hit this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I think it is not good to allocate 100% of local mem, then it will not be possible to use local mem for other purposes inside the kernel. If [# work items running in parallel] * [flush buffer size provided by user] exceeds 80% of local memory then data for work items that don't have flush buffer will not be printed. But as far as I know maximum number of work items in work group for each device is defined, so this number could not be too big. Theoretically user can provide any flush buffer size, but I am not sure that any implementation can handle any input provided by user. Probably we can write an error message if user provided too big flush buffer size ( if allocated local memory is not enough).
This design (to use local memory) was discussed with @rolandschulz and @dbabokin.
@rolandschulz @dbabokin Could you please also provide your input regarding this question?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you defer this computation somehow till after the run-time has chosen the wg-size? Or do we always choose max WG-size/GlobalSize so this matches?
We should throw an exception if the flush buffer is too big.
@@ -116,7 +135,7 @@ class stream_impl { | |||
|
|||
// Maximum number of symbols which could be streamed from the beginning of a | |||
// statement till the semicolon | |||
size_t MaxStatementSize_; | |||
unsigned MaxStatementSize_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that it's pretty unlikely that someone is going to push more than 4GB to a SYCL stream, but what is the reason for this change? Is it possible to overflow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used unsigned type for offsets in the code. I can change this.
const char *Str, unsigned Len, unsigned Padding = 0) { | ||
if ((FlushBufferSize - Offset < Len + Padding) || | ||
(WIOffset + Offset + Len + Padding > FlushBufs.get_count())) | ||
// TODO: flush here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you plan to address/remove this todo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I planned to address this in further commits.
// buffer. | ||
FlushBufferSize = FlushSize[0].load(); | ||
WIOffsetAcc[0].store(0); | ||
detail::workGroupBarrier(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to have WIOffsetAcc
zero-initialized and get rid of the barrier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't found possibility to do so. Please share with me if you have any ideas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the WIOffsetAcc
is a local pointer, and each WI increments it by FlushBufferSize
to determine offset for this WI, right? Why not just compute the offset based on a WI id like this: WIOffset = get_local_id() * FlushBufferSize
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, correct. I was not able to use this approach WIOffset = get_local_id() * FlushBufferSize because I need instance of sycl::id object to call get_local_id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need instance of sycl::id object to call get_local_id.
You can #include <CL/__spirv/spirv_vars.hpp>
and use __spirv_BuiltInLocalInvocationId
variables declared there from any function you need. They are lowered to the same SPIR-V as the OpenCL get_local_id().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know that, this makes sense, thx. Then I will prepare follow up improvements/fixes for committed patch.
Replaced usages of "VectorType::getNumElements" with "FixedVectorType::getNumElements" Signed-off-by: amochalo <[email protected]>
* upstream/sycl: (405 commits) [SYCL] Implement new env var SYCL_DEVICE_FILTER (intel#2239) [Driver][SYCL] Make /MD the default for -fsycl (intel#2478) [SYCL]: basic support of contexts with multiple devices in Level-Zero (intel#2440) [SYCL] Fix LIT regression after 9dd18ca (intel#2481) [SYCL][L0] Kernel Destroy in piKernelRelease (intel#2475) [SYCL] Emit an aliased function only if it is used (intel#2430) [Driver][SYCL] Add defaultlib directive for sycl lib (intel#2464) [Driver][SYCL] Improve situations where .exe is added for AOT tools (intel#2467) [SYCL][L0]: Check Queue refcnt prior to using members in event wait/release (intel#2471) [SYCL] Unroll several loops in __init method accessor class (intel#2449) [SYCL][Doc] Add link to use pinned memory spec (intel#2463) [SYCL] Link SYCL device libraries by default. (intel#2400) Revert "[SYCL] XFAIL test blcoking pulldown" Avoid usage of deprecated "VectorType::getNumElements" (intel#737) Fix nullptr dereference (intel#741) Do not translate arbitrary precision operations without corresponding extensions (intel#714) Add Constrained Floating-Point Intrinsics support [SYCL] Take into account auxiliary cmake options for Level Zero loader [InstCombine] improve fold of pointer differences [InstCombine] add ptr difference tests; NFC ...
Pool of flush buffers is allocated in local memory. This pool contains
space for each work item in the work group. Each work item writes to
its own space (flush buffer), as a result output from different work
items is not mixed. Data is flushed to global buffer on endl, flush or
when kernel execution is finished. Global buffer contains all output
from the kernel. Offset of the WI's flush buffer in the pool is
calculated only once in __init method. Call to this method is generated
by frontend.
Signed-off-by: Artur Gainullin [email protected]