Skip to content

[SYCL][CUDA] Add basic sub-group functionality #2587

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Oct 5, 2020

Conversation

Pennycook
Copy link
Contributor

Implements:

  • Sub-group local id
  • Sub-group id
  • Number of sub-groups
  • Sub-group size
  • Max sub-group size

The implementations are functionally correct, but may benefit from
additional optimization.

Signed-off-by: John Pennycook [email protected]


The implementation is different to the one proposed in https://intel.github.io/llvm-docs/cuda/opencl-subgroup-vs-cuda-crosslane-op.html, because I don't think sreg.warpid and sreg.nwarpid have the correct semantics for sub-groups. The mapping from work-items to sub-groups is invariant during a kernel's execution, which isn't true of the warp ID in PTX. As far as I can tell, the number of warp IDs represents the maximum number of warps that can execute in a CTA rather than the number of warps in a CTA.

NVIDIA's PTX documentation says that tid should be used to compute a "virtual warp ID" if one is required, which is what I've implemented. I convinced myself that we have to compute the sub-group IDs and sizes from the linear size of the work-group, and couldn't find a simpler way to express this.

Ideally, we wouldn't have to re-compute each of these values on every call. It would be sufficient to compute them once at the start of the kernel and then re-use them, but I don't have enough knowledge of Clang/LLVM/libclc to implement that.

Mirrors approach taken for work-group built-ins in libclc.
Also removed unused EnqueuedNumSubgroups built-in.

Signed-off-by: John Pennycook <[email protected]>
Implements:
- Sub-group local id
- Sub-group id
- Number of sub-groups
- Sub-group size
- Max sub-group size

The implementations are functionally correct, but may benefit from
additional optimization.

Signed-off-by: John Pennycook <[email protected]>
@Pennycook Pennycook added enhancement New feature or request spec extension All issues/PRs related to extensions specifications cuda CUDA back-end labels Oct 2, 2020
@Pennycook Pennycook requested review from bader and a team as code owners October 2, 2020 15:55
@bader
Copy link
Contributor

bader commented Oct 2, 2020

Tagging @Naghasan.

@illuhad
Copy link

illuhad commented Oct 3, 2020

NVIDIA's PTX documentation says that tid should be used to compute a "virtual warp ID" if one is required, which is what I've implemented. I convinced myself that we have to compute the sub-group IDs and sizes from the linear size of the work-group, and couldn't find a simpler way to express this.

Yes, I also stumbled across that - that's also how it works in hipSYCL. Additionally, my understanding is that the laneid register may be less performant than computing the subgroup local id yourself (which is what hipSYCL does), see e.g. https://stackoverflow.com/questions/44337309/whats-the-most-efficient-way-to-calculate-the-warp-id-lane-id-in-a-1-d-grid

@bader bader merged commit baed6a5 into intel:sycl Oct 5, 2020
@Pennycook
Copy link
Contributor Author

Thanks, @illuhad.

Additionally, my understanding is that the laneid register may be less performant than computing the subgroup local id yourself (which is what hipSYCL does), see e.g. https://stackoverflow.com/questions/44337309/whats-the-most-efficient-way-to-calculate-the-warp-id-lane-id-in-a-1-d-grid

We should definitely revisit this if benchmarks suggest that manually computing the lane id is always better. I found some references to special registers potentially having high latency, but nothing about the relative cost of different special registers. Since we'd need to read the tid and ntid special registers to compute things manually, I wasn't confident that I understood the trade-off here.

@Pennycook Pennycook deleted the cuda-sub-groups branch October 5, 2020 20:15
alexbatashev pushed a commit to alexbatashev/llvm that referenced this pull request Oct 7, 2020
* sycl:
  [SYCL] Fix test failure due to size discrepancy in 32 and 64 bit environment (intel#2594)
  [BuildBot] Linux GPU driver uplift to 20.39.17972 (intel#2600)
  [SYCL][NFC] Remove cyclic dependency in headers (intel#2601)
  [SYCL][Doc] Fix Sphinx docs index (intel#2599)
  [SYCL][PI][L0] Add an assert to piEnqueueKernelLaunch() when the GlobalWorkOffset!=0 (intel#2593)
  [SYCL][Driver] Turn on -spirv-allow-extra-diexpressions option (intel#2578)
  [SYCL] Serialize queue related PI API in the Level-Zero plugin (intel#2588)
  Added missing math APIs for devicelib. (intel#2558)
  [SYCL][DOC] Fix path to FPGA device selector (intel#2563)
  [SYCL][CUDA] Add basic sub-group functionality (intel#2587)
  [SYCL] Add specialization constant feature design doc. (intel#2572)
  [SYCL] Expand release lit test to CPU and ACC (intel#2589)
  [SYCL] Add clang support for FPGA kernel attribute scheduler_target_fmax_mhz (intel#2511)
  [SYCL][ESIMD] Fixed compiler crash in LowerESIMDVecArg pass (intel#2556)
KseniyaTikhomirova added a commit to KseniyaTikhomirova/llvm that referenced this pull request Mar 9, 2023
Signed-off-by: Tikhomirova, Kseniya <[email protected]>
jsji pushed a commit that referenced this pull request May 30, 2024
The patch adds CacheControls(Load/Store)INTEL decorations representation
as metadata placed on memory accessing instruction with the following form:
!spirv.DecorationCacheControlINTEL !X
!X = !{i32 %decoration_kind%, i32 %level%, i32 %control%,
       i32 %operand of the instruction to decorate%}

Also patch creates a dummy GEP accessing pointer operand of the
instruction to perform SSA copy of the pointer and attaches !spirv.Decorations
metadata to the GEP.

Few notes about this implementation and other options. It is sub-optimal
as it requires iteration over all instructions in the module.
Alternatives are:
1. In SPIRVWriter during transDecorations rewrite already translated
   instruction (lets say load) with the new one, but with GEP operand.
   But while the translator provides API to erase instructions and
   rewriting SPIRVBasicBlocks, yet unfortunately it's not really usable
   outside of of SPIRVModule context;
2. In SPIRVWriter during transValueWithoutDecoration add special
   handling of the instructions with spirv.DecorationCacheControlINTEL
   metadata. And it would work with one exception - it also would
   require to unprivate some methods of SPIRVScavenger to create GEP
   with the correct type, which I'm not sure if it's a good idea.
   Note, in this implementation in the worst case scenario
   SPIRVScavenger would create a bitcast later during translation.

Signed-off-by: Sidorov, Dmitry <[email protected]>

Original commit:
KhronosGroup/SPIRV-LLVM-Translator@c8bfc33ef8f9f27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda CUDA back-end enhancement New feature or request spec extension All issues/PRs related to extensions specifications
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants