Skip to content

[SYCL][ESIMD] Add compile time properties overload of USM block store #11641

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Oct 30, 2023

Conversation

sarnex
Copy link
Contributor

@sarnex sarnex commented Oct 24, 2023

This change adds the groundwork for adding overloads of the block_store APIs accepting compile time properties. We have 8 overloads total, with various combinations of offset, predicate and simd_view.

@sarnex sarnex temporarily deployed to WindowsCILock October 24, 2023 14:57 — with GitHub Actions Inactive
@sarnex sarnex temporarily deployed to WindowsCILock October 24, 2023 18:03 — with GitHub Actions Inactive
@sarnex sarnex temporarily deployed to WindowsCILock October 25, 2023 15:47 — with GitHub Actions Inactive
@sarnex sarnex temporarily deployed to WindowsCILock October 25, 2023 16:23 — with GitHub Actions Inactive
@sarnex sarnex temporarily deployed to WindowsCILock October 26, 2023 17:02 — with GitHub Actions Inactive
@sarnex sarnex temporarily deployed to WindowsCILock October 26, 2023 17:07 — with GitHub Actions Inactive
@sarnex sarnex temporarily deployed to WindowsCILock October 26, 2023 17:14 — with GitHub Actions Inactive
@sarnex sarnex temporarily deployed to WindowsCILock October 26, 2023 17:48 — with GitHub Actions Inactive
testUSM<T, 33, !CheckMask, CheckProperties>(Q, 2, 4, AlignOnlyProps);
// TODO: Enable after failure fixed
// Passed &=
// testUSM<T, 67, !CheckMask, CheckProperties>(Q, 1, 4, AlignOnlyProps);
Copy link
Contributor Author

@sarnex sarnex Oct 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case fails even when using the old API, I reproduced it in a standalone test case. I wanted to see what happens before we moved to intrinsics, but we actually assert that the size is a multiple of 16, so we couldn't do it in the old way. I made an internal tracker for this, it should be unrelated to this PR, it just exposed the test case. Maybe the test is wrong and only this case exposes it, but I don't see where.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first guess is that GPU BE lowers LLVM IR store <T x 67> incorrectly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my guess too but I did not have enough courage to say it :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to analyse/check it on our side first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, I have an internal tracker assigned to me for this.

@sarnex sarnex marked this pull request as ready for review October 26, 2023 20:22
@sarnex sarnex requested a review from a team as a code owner October 26, 2023 20:22
Copy link
Contributor

@v-klochkov v-klochkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good. I have several comments - all of the are minor.

testUSM<T, 33, !CheckMask, CheckProperties>(Q, 2, 4, AlignOnlyProps);
// TODO: Enable after failure fixed
// Passed &=
// testUSM<T, 67, !CheckMask, CheckProperties>(Q, 1, 4, AlignOnlyProps);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first guess is that GPU BE lowers LLVM IR store <T x 67> incorrectly.

Comment on lines 1393 to 1396
/// Alignment: If \p props does not specify the 'alignment' property, then
/// the default assumed alignment is the minimally required element-size
/// alignment. Note that additional/temporary restrictions may apply
/// (see Restrictions below).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major concern here: if apply the statements here, then the new block_store(usm_ptr, value) call may work slower than the old one because on Gen12 STORE requires 16-bytes alignment to produce block_store operation. Otherwise, scatter will be generated.

Assuming 4-byte alignment for store of simd<int, N> is valid, but will be a regression.
I think we need to raise the expected alignment to 16-bytes if cache-hints are not passed and predicate is not used. That will be consistent with the old block_store that had the default alignment = overaligned_tagdetail::OperandSize::OWORD

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I implemented this in my latest commit and updated comments as necessary, but please review carefully.

Signed-off-by: Sarnie, Nick <[email protected]>
@sarnex sarnex temporarily deployed to WindowsCILock October 27, 2023 16:09 — with GitHub Actions Inactive
@sarnex sarnex temporarily deployed to WindowsCILock October 27, 2023 16:41 — with GitHub Actions Inactive
@@ -1536,14 +1539,25 @@ block_store(T *ptr, simd<T, N> vals, simd_mask<1> pred,
static_assert(!PropertyListT::template has_property<cache_hint_L3_key>(),
"L3 cache hint is reserved. The old/experimental L3 LSC cache "
"hint is cache_level::L2 now.");
bool ShouldUseOWordDefaultAlign =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we should not having this dynamic check/jump and generating 2 different versions of block-load for 1 user's block_load() call.
Let's say that because this variant of the function (accepting a predicate) is called, we use the DG2/PVC restrictions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah good feedback, i wasnt sure which direction to go, thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in latest commit hopefully

Signed-off-by: Sarnie, Nick <[email protected]>
@sarnex sarnex temporarily deployed to WindowsCILock October 27, 2023 17:45 — with GitHub Actions Inactive
@sarnex sarnex temporarily deployed to WindowsCILock October 27, 2023 18:58 — with GitHub Actions Inactive
@sarnex sarnex requested a review from v-klochkov October 30, 2023 15:05
@v-klochkov v-klochkov merged commit d38206c into intel:sycl Oct 30, 2023
simd<uint32_t, N> PassThruInt(ElemOff, 1);
simd<T, N> Vals = PassThruInt;
if constexpr (UseMask) {
simd_mask<1> Mask = (GlobalID + 1) % 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Nick,
I just found an error in block_load.hpp, and will fix it soon and it seems you copy-pasted it to this block_store.hpp.
The code that was supposed to be here and in few other places is: simd_mask<1> Mask = (GlobalID + 1) & 0x1;

Copy link
Contributor

@v-klochkov v-klochkov Oct 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Val % 1) always gives 0.
Can you please fix it in block_store.hpp file and test if your patch still works correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do this now, thanks for the heads up

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Luckily it passes, I'm making a PR now

maarquitos14 pushed a commit that referenced this pull request Oct 31, 2023
…11641)

This change adds the groundwork for adding overloads of the block_store
APIs accepting compile time properties (L1,L2 cache hints, alignment).
We have 8 overloads total, with various combinations of offset, predicate and simd_view.

---------

Signed-off-by: Sarnie, Nick <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants