Constant memory optimization for CUDA backend

**Is your feature request related to a problem? Please describe**

In CUDA, a kernel functor can be copied to a CUDA symbol, which can be marked \_\_const\_\_. This enables the compiler to perform various optimisations including LICM & CSE. DPC++ does not currently support this feature; kernel arguments are always passed into global device memory. This means that significant optimisation opportunities are lost.

An example of the use of CUDA Symbols from Kokkos:

```c++
    // Copy functor asynchronously from there to constant memory on the device
    cudaMemcpyToSymbolAsync(kokkos_impl_cuda_constant_memory_buffer, staging,
                            sizeof(DriverType), 0, cudaMemcpyHostToDevice,
                            cudaStream_t(cuda_instance->m_stream));

    // Invoke the driver function on the device
    (base_t::
         get_kernel_func())<<<grid, block, shmem, cuda_instance->m_stream>>>();
```

We have observed up to 20% improvement in kernel performance from Kokkos' CUDA backend when using symbols, and expect similar improvements in Kokkos' SYCL backend.

**Describe the solution you would like**

It would be useful for a SYCL user to be able to mark specific kernel functions to have their arguments loaded into a CUDA symbol in constant memory. This feature could be enabled for the SYCL CUDA backend via changes to the Clang driver & SYCL Runtime, and the addition of a new kernel attribute.

**Kernel Class**

A new attribute on the kernel class (e.g. `sycl::kernel_constant_mem`) would inform the Clang driver that the given kernel should read parameters from a CUDA symbol in the constant memory space as opposed to standard kernel parameters in the global memory space.

**Driver**

The driver should take kernel functions marked `sycl::kernel_constant_mem` and replace e.g. `ld.param` instructions with `ld.const` instructions which point to offsets in a *single* constant memory symbol.

**LLVM Pass**

The LLVM pass which performs this conversion should occur as early as possible. Specifically, it should occur before optimisation passes which depend on the const-ness of the kernel parameters.

The driver is also responsible for allocating this single constant device symbol with a stable name which is known to the SYCL runtime.

An additional flag to the compiler (e.g. `-fsycl-use-constant-symbols`) informs the compiler whether to perform this LLVM pass & symbol allocation.

**SYCL Runtime**

If the `-fsycl-use-constant-symbols` flag was enabled, and the SYCL Runtime launches a kernel marked with `sycl::kernel_constant_mem`, it should first copy the flattened functor to the constant CUDA symbol allocated by the driver. The runtime need not perform any modifications to the kernel, nor does it need to store duplicate 'constant' & 'non-constant' versions, because this is handled by the compiler.

**Notes**

The proposed solution involves a *single* constant memory allocation which consumes the entirety of the device's constant memory. When the runtime handles multiple `sycl::kernel_constant_mem` kernels, the constant symbol is reused.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Constant memory optimization for CUDA backend #4278

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Constant memory optimization for CUDA backend #4278

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions