Skip to content

Constant memory optimization for CUDA backend #4278

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
joeatodd opened this issue Aug 6, 2021 · 2 comments
Open

Constant memory optimization for CUDA backend #4278

joeatodd opened this issue Aug 6, 2021 · 2 comments
Assignees
Labels
cuda CUDA back-end enhancement New feature or request performance Performance related issues

Comments

@joeatodd
Copy link
Contributor

joeatodd commented Aug 6, 2021

Is your feature request related to a problem? Please describe

In CUDA, a kernel functor can be copied to a CUDA symbol, which can be marked __const__. This enables the compiler to perform various optimisations including LICM & CSE. DPC++ does not currently support this feature; kernel arguments are always passed into global device memory. This means that significant optimisation opportunities are lost.

An example of the use of CUDA Symbols from Kokkos:

    // Copy functor asynchronously from there to constant memory on the device
    cudaMemcpyToSymbolAsync(kokkos_impl_cuda_constant_memory_buffer, staging,
                            sizeof(DriverType), 0, cudaMemcpyHostToDevice,
                            cudaStream_t(cuda_instance->m_stream));

    // Invoke the driver function on the device
    (base_t::
         get_kernel_func())<<<grid, block, shmem, cuda_instance->m_stream>>>();

We have observed up to 20% improvement in kernel performance from Kokkos' CUDA backend when using symbols, and expect similar improvements in Kokkos' SYCL backend.

Describe the solution you would like

It would be useful for a SYCL user to be able to mark specific kernel functions to have their arguments loaded into a CUDA symbol in constant memory. This feature could be enabled for the SYCL CUDA backend via changes to the Clang driver & SYCL Runtime, and the addition of a new kernel attribute.

Kernel Class

A new attribute on the kernel class (e.g. sycl::kernel_constant_mem) would inform the Clang driver that the given kernel should read parameters from a CUDA symbol in the constant memory space as opposed to standard kernel parameters in the global memory space.

Driver

The driver should take kernel functions marked sycl::kernel_constant_mem and replace e.g. ld.param instructions with ld.const instructions which point to offsets in a single constant memory symbol.

LLVM Pass

The LLVM pass which performs this conversion should occur as early as possible. Specifically, it should occur before optimisation passes which depend on the const-ness of the kernel parameters.

The driver is also responsible for allocating this single constant device symbol with a stable name which is known to the SYCL runtime.

An additional flag to the compiler (e.g. -fsycl-use-constant-symbols) informs the compiler whether to perform this LLVM pass & symbol allocation.

SYCL Runtime

If the -fsycl-use-constant-symbols flag was enabled, and the SYCL Runtime launches a kernel marked with sycl::kernel_constant_mem, it should first copy the flattened functor to the constant CUDA symbol allocated by the driver. The runtime need not perform any modifications to the kernel, nor does it need to store duplicate 'constant' & 'non-constant' versions, because this is handled by the compiler.

Notes

The proposed solution involves a single constant memory allocation which consumes the entirety of the device's constant memory. When the runtime handles multiple sycl::kernel_constant_mem kernels, the constant symbol is reused.

@joeatodd joeatodd added the enhancement New feature or request label Aug 6, 2021
@bader bader added the cuda CUDA back-end label Aug 6, 2021
@joeatodd joeatodd changed the title CUDA Symbols Constant memory optimization for CUDA backend Aug 6, 2021
@keryell
Copy link
Contributor

keryell commented Aug 6, 2021

That sounds like a useful feature.
It would be preferable to use a compile-time kernel property à la #4280 or a decorating function, instead of adding yet another attribute.

@AerialMantis AerialMantis added the performance Performance related issues label Aug 18, 2021
@jchlanda
Copy link
Contributor

jchlanda commented Jan 6, 2023

A PR implementing CUDA native spec constants: #7946, which exploit the same approach (memcpy of specialization constants to CUDA symbol).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda CUDA back-end enhancement New feature or request performance Performance related issues
Projects
None yet
6 participants