[SYCL][CUDA] Add basic sub-group functionality #2587

Pennycook · 2020-10-02T15:55:07Z

Implements:

Sub-group local id
Sub-group id
Number of sub-groups
Sub-group size
Max sub-group size

The implementations are functionally correct, but may benefit from
additional optimization.

Signed-off-by: John Pennycook [email protected]

The implementation is different to the one proposed in https://intel.github.io/llvm-docs/cuda/opencl-subgroup-vs-cuda-crosslane-op.html, because I don't think sreg.warpid and sreg.nwarpid have the correct semantics for sub-groups. The mapping from work-items to sub-groups is invariant during a kernel's execution, which isn't true of the warp ID in PTX. As far as I can tell, the number of warp IDs represents the maximum number of warps that can execute in a CTA rather than the number of warps in a CTA.

NVIDIA's PTX documentation says that tid should be used to compute a "virtual warp ID" if one is required, which is what I've implemented. I convinced myself that we have to compute the sub-group IDs and sizes from the linear size of the work-group, and couldn't find a simpler way to express this.

Ideally, we wouldn't have to re-compute each of these values on every call. It would be sufficient to compute them once at the start of the kernel and then re-use them, but I don't have enough knowledge of Clang/LLVM/libclc to implement that.

Mirrors approach taken for work-group built-ins in libclc. Also removed unused EnqueuedNumSubgroups built-in. Signed-off-by: John Pennycook <[email protected]>

Implements: - Sub-group local id - Sub-group id - Number of sub-groups - Sub-group size - Max sub-group size The implementations are functionally correct, but may benefit from additional optimization. Signed-off-by: John Pennycook <[email protected]>

Signed-off-by: John Pennycook <[email protected]>

bader · 2020-10-02T17:38:58Z

Tagging @Naghasan.

libclc/ptx-nvidiacl/libspirv/workitem/get_sub_group_size.cl

Signed-off-by: John Pennycook <[email protected]>

illuhad · 2020-10-03T20:57:26Z

NVIDIA's PTX documentation says that tid should be used to compute a "virtual warp ID" if one is required, which is what I've implemented. I convinced myself that we have to compute the sub-group IDs and sizes from the linear size of the work-group, and couldn't find a simpler way to express this.

Yes, I also stumbled across that - that's also how it works in hipSYCL. Additionally, my understanding is that the laneid register may be less performant than computing the subgroup local id yourself (which is what hipSYCL does), see e.g. https://stackoverflow.com/questions/44337309/whats-the-most-efficient-way-to-calculate-the-warp-id-lane-id-in-a-1-d-grid

Pennycook · 2020-10-05T17:35:53Z

Thanks, @illuhad.

Additionally, my understanding is that the laneid register may be less performant than computing the subgroup local id yourself (which is what hipSYCL does), see e.g. https://stackoverflow.com/questions/44337309/whats-the-most-efficient-way-to-calculate-the-warp-id-lane-id-in-a-1-d-grid

We should definitely revisit this if benchmarks suggest that manually computing the lane id is always better. I found some references to special registers potentially having high latency, but nothing about the relative cost of different special registers. Since we'd need to read the tid and ntid special registers to compute things manually, I wasn't confident that I understood the trade-off here.

* sycl: [SYCL] Fix test failure due to size discrepancy in 32 and 64 bit environment (intel#2594) [BuildBot] Linux GPU driver uplift to 20.39.17972 (intel#2600) [SYCL][NFC] Remove cyclic dependency in headers (intel#2601) [SYCL][Doc] Fix Sphinx docs index (intel#2599) [SYCL][PI][L0] Add an assert to piEnqueueKernelLaunch() when the GlobalWorkOffset!=0 (intel#2593) [SYCL][Driver] Turn on -spirv-allow-extra-diexpressions option (intel#2578) [SYCL] Serialize queue related PI API in the Level-Zero plugin (intel#2588) Added missing math APIs for devicelib. (intel#2558) [SYCL][DOC] Fix path to FPGA device selector (intel#2563) [SYCL][CUDA] Add basic sub-group functionality (intel#2587) [SYCL] Add specialization constant feature design doc. (intel#2572) [SYCL] Expand release lit test to CPU and ACC (intel#2589) [SYCL] Add clang support for FPGA kernel attribute scheduler_target_fmax_mhz (intel#2511) [SYCL][ESIMD] Fixed compiler crash in LowerESIMDVecArg pass (intel#2556)

Signed-off-by: Tikhomirova, Kseniya <[email protected]>

The patch adds CacheControls(Load/Store)INTEL decorations representation as metadata placed on memory accessing instruction with the following form: !spirv.DecorationCacheControlINTEL !X !X = !{i32 %decoration_kind%, i32 %level%, i32 %control%, i32 %operand of the instruction to decorate%} Also patch creates a dummy GEP accessing pointer operand of the instruction to perform SSA copy of the pointer and attaches !spirv.Decorations metadata to the GEP. Few notes about this implementation and other options. It is sub-optimal as it requires iteration over all instructions in the module. Alternatives are: 1. In SPIRVWriter during transDecorations rewrite already translated instruction (lets say load) with the new one, but with GEP operand. But while the translator provides API to erase instructions and rewriting SPIRVBasicBlocks, yet unfortunately it's not really usable outside of of SPIRVModule context; 2. In SPIRVWriter during transValueWithoutDecoration add special handling of the instructions with spirv.DecorationCacheControlINTEL metadata. And it would work with one exception - it also would require to unprivate some methods of SPIRVScavenger to create GEP with the correct type, which I'm not sure if it's a good idea. Note, in this implementation in the worst case scenario SPIRVScavenger would create a bitcast later during translation. Signed-off-by: Sidorov, Dmitry <[email protected]> Original commit: KhronosGroup/SPIRV-LLVM-Translator@c8bfc33ef8f9f27

Pennycook added 3 commits October 1, 2020 14:22

[SYCL] Wrap sub-group built-ins in functions

75848a6

Mirrors approach taken for work-group built-ins in libclc. Also removed unused EnqueuedNumSubgroups built-in. Signed-off-by: John Pennycook <[email protected]>

[SYCL][CUDA] Enable basic sub-group tests for CUDA

b118698

Signed-off-by: John Pennycook <[email protected]>

Pennycook added enhancement New feature or request spec extension All issues/PRs related to extensions specifications cuda CUDA back-end labels Oct 2, 2020

Pennycook requested review from bader and a team as code owners October 2, 2020 15:55

Pennycook requested a review from sergey-semenov October 2, 2020 15:55

bader mentioned this pull request Oct 2, 2020

[SYCL][CUDA] Comping SYCL code with subgroups failing #1767

Closed

bader reviewed Oct 2, 2020

View reviewed changes

libclc/ptx-nvidiacl/libspirv/workitem/get_sub_group_size.cl Outdated Show resolved Hide resolved

libclc/ptx-nvidiacl/libspirv/workitem/get_sub_group_size.cl Outdated Show resolved Hide resolved

Pennycook added 2 commits October 2, 2020 13:55

[SYCL][CUDA] Remove else after return

41d157d

Signed-off-by: John Pennycook <[email protected]>

[SYCL][CUDA] Add sub-group headers to spirv.h

fcc055a

Signed-off-by: John Pennycook <[email protected]>

bader approved these changes Oct 2, 2020

View reviewed changes

sergey-semenov approved these changes Oct 5, 2020

View reviewed changes

bader merged commit baed6a5 into intel:sycl Oct 5, 2020

Pennycook deleted the cuda-sub-groups branch October 5, 2020 20:15

KseniyaTikhomirova added a commit to KseniyaTikhomirova/llvm that referenced this pull request Mar 9, 2023

Fix post commit failures after intel#2587

8d42f54

Signed-off-by: Tikhomirova, Kseniya <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SYCL][CUDA] Add basic sub-group functionality #2587

[SYCL][CUDA] Add basic sub-group functionality #2587

Uh oh!

Pennycook commented Oct 2, 2020

Uh oh!

bader commented Oct 2, 2020

Uh oh!

Uh oh!

Uh oh!

illuhad commented Oct 3, 2020

Uh oh!

Pennycook commented Oct 5, 2020

Uh oh!

Uh oh!

[SYCL][CUDA] Add basic sub-group functionality #2587

[SYCL][CUDA] Add basic sub-group functionality #2587

Uh oh!

Conversation

Pennycook commented Oct 2, 2020

Uh oh!

bader commented Oct 2, 2020

Uh oh!

Uh oh!

Uh oh!

illuhad commented Oct 3, 2020

Uh oh!

Pennycook commented Oct 5, 2020

Uh oh!

Uh oh!