-
Notifications
You must be signed in to change notification settings - Fork 769
[SYCL][CUDA] Add basic sub-group functionality #2587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Mirrors approach taken for work-group built-ins in libclc. Also removed unused EnqueuedNumSubgroups built-in. Signed-off-by: John Pennycook <[email protected]>
Implements: - Sub-group local id - Sub-group id - Number of sub-groups - Sub-group size - Max sub-group size The implementations are functionally correct, but may benefit from additional optimization. Signed-off-by: John Pennycook <[email protected]>
Signed-off-by: John Pennycook <[email protected]>
Tagging @Naghasan. |
Signed-off-by: John Pennycook <[email protected]>
Signed-off-by: John Pennycook <[email protected]>
Yes, I also stumbled across that - that's also how it works in hipSYCL. Additionally, my understanding is that the laneid register may be less performant than computing the subgroup local id yourself (which is what hipSYCL does), see e.g. https://stackoverflow.com/questions/44337309/whats-the-most-efficient-way-to-calculate-the-warp-id-lane-id-in-a-1-d-grid |
Thanks, @illuhad.
We should definitely revisit this if benchmarks suggest that manually computing the lane id is always better. I found some references to special registers potentially having high latency, but nothing about the relative cost of different special registers. Since we'd need to read the |
* sycl: [SYCL] Fix test failure due to size discrepancy in 32 and 64 bit environment (intel#2594) [BuildBot] Linux GPU driver uplift to 20.39.17972 (intel#2600) [SYCL][NFC] Remove cyclic dependency in headers (intel#2601) [SYCL][Doc] Fix Sphinx docs index (intel#2599) [SYCL][PI][L0] Add an assert to piEnqueueKernelLaunch() when the GlobalWorkOffset!=0 (intel#2593) [SYCL][Driver] Turn on -spirv-allow-extra-diexpressions option (intel#2578) [SYCL] Serialize queue related PI API in the Level-Zero plugin (intel#2588) Added missing math APIs for devicelib. (intel#2558) [SYCL][DOC] Fix path to FPGA device selector (intel#2563) [SYCL][CUDA] Add basic sub-group functionality (intel#2587) [SYCL] Add specialization constant feature design doc. (intel#2572) [SYCL] Expand release lit test to CPU and ACC (intel#2589) [SYCL] Add clang support for FPGA kernel attribute scheduler_target_fmax_mhz (intel#2511) [SYCL][ESIMD] Fixed compiler crash in LowerESIMDVecArg pass (intel#2556)
Signed-off-by: Tikhomirova, Kseniya <[email protected]>
The patch adds CacheControls(Load/Store)INTEL decorations representation as metadata placed on memory accessing instruction with the following form: !spirv.DecorationCacheControlINTEL !X !X = !{i32 %decoration_kind%, i32 %level%, i32 %control%, i32 %operand of the instruction to decorate%} Also patch creates a dummy GEP accessing pointer operand of the instruction to perform SSA copy of the pointer and attaches !spirv.Decorations metadata to the GEP. Few notes about this implementation and other options. It is sub-optimal as it requires iteration over all instructions in the module. Alternatives are: 1. In SPIRVWriter during transDecorations rewrite already translated instruction (lets say load) with the new one, but with GEP operand. But while the translator provides API to erase instructions and rewriting SPIRVBasicBlocks, yet unfortunately it's not really usable outside of of SPIRVModule context; 2. In SPIRVWriter during transValueWithoutDecoration add special handling of the instructions with spirv.DecorationCacheControlINTEL metadata. And it would work with one exception - it also would require to unprivate some methods of SPIRVScavenger to create GEP with the correct type, which I'm not sure if it's a good idea. Note, in this implementation in the worst case scenario SPIRVScavenger would create a bitcast later during translation. Signed-off-by: Sidorov, Dmitry <[email protected]> Original commit: KhronosGroup/SPIRV-LLVM-Translator@c8bfc33ef8f9f27
Implements:
The implementations are functionally correct, but may benefit from
additional optimization.
Signed-off-by: John Pennycook [email protected]
The implementation is different to the one proposed in https://intel.github.io/llvm-docs/cuda/opencl-subgroup-vs-cuda-crosslane-op.html, because I don't think
sreg.warpid
andsreg.nwarpid
have the correct semantics for sub-groups. The mapping from work-items to sub-groups is invariant during a kernel's execution, which isn't true of the warp ID in PTX. As far as I can tell, the number of warp IDs represents the maximum number of warps that can execute in a CTA rather than the number of warps in a CTA.NVIDIA's PTX documentation says that
tid
should be used to compute a "virtual warp ID" if one is required, which is what I've implemented. I convinced myself that we have to compute the sub-group IDs and sizes from the linear size of the work-group, and couldn't find a simpler way to express this.Ideally, we wouldn't have to re-compute each of these values on every call. It would be sufficient to compute them once at the start of the kernel and then re-use them, but I don't have enough knowledge of Clang/LLVM/libclc to implement that.