Support buffer location on CUDA #5827

sherry-yuan · 2022-03-16T20:53:08Z

Cuda support allocation in global memory and cache, the ask is to let cuda usm allocation reserve cache memory when buffer location property of value of 4 is passed in.

change will need to be made here:

llvm/sycl/plugins/cuda/pi_cuda.cpp

Lines 4573 to 4593 in d47dda3

    
           pi_result cuda_piextUSMDeviceAlloc(void **result_ptr, pi_context context, 
        
                                              pi_device device, 
        
                                              pi_usm_mem_properties *properties, 
        
                                              size_t size, pi_uint32 alignment) { 
        
             assert(result_ptr != nullptr); 
        
             assert(context != nullptr); 
        
             assert(device != nullptr); 
        
             assert(properties == nullptr); 
        
             pi_result result = PI_SUCCESS; 
        
             try { 
        
               ScopedContext active(context); 
        
               result = PI_CHECK_ERROR(cuMemAlloc((CUdeviceptr *)result_ptr, size)); 
        
             } catch (pi_result error) { 
        
               result = error; 
        
             } 
        
             assert(alignment == 0 || 
        
                    (result == PI_SUCCESS && 
        
                     reinterpret_cast<std::uintptr_t>(*result_ptr) % alignment == 0)); 
        
             return result; 
        
           }

enum class target {
  device,
  host_task,
  global_buffer = device, // Deprecated
  constant_buffer, // Deprecated
  local, // Deprecated
  host_buffer // Deprecated
};

The text was updated successfully, but these errors were encountered:

sherry-yuan · 2022-03-18T13:48:27Z

Eventually, this will be supported through the full solution to new usm API (spec here: #5656). Once that is done, the properties of properties{noalias, read_only} can be passed to annotated_ptr (spec here: #5755) such that compilers can make additional optimization in where the memory is allocated.

queue q;
auto int *t = malloc_device<int>(N, q, property p{t, properties{noalias, read_only}})
// above returns an annotated_ptr with property p{t, properties{noalias, read_only}};
q.single_task([=] {
     … = p[…]; // The compiler knows it’s safe to cache p because it never changes
});

Thanks to @GarveyJoe for the suggested solution above!

zjin-lcf · 2022-03-18T17:18:52Z

In the code snippet, the compiler returns the read-only property only after it analyzes the kernel. Is that right ?

sherry-yuan · 2022-03-18T17:25:30Z

I believe it is user who provide the property to malloc API (I updated the example code a little so that it is clear). The user provided information (readonly & noalias) can be used in compiler in deciding whether the optimization (eg allocating it in constant memory) is possible.

zjin-lcf · 2022-03-18T18:06:03Z

It is clear.
Reading CUDA programs shows that developers may allocate CUDA device or constant memory statically in the global scope. This feature is helpful for porting CUDA programs.

Thanks.

JackAKirk · 2023-03-06T15:17:41Z

It is clear. Reading CUDA programs shows that developers may allocate CUDA device or constant memory statically in the global scope. This feature is helpful for porting CUDA programs.

Thanks.

Let's be careful not to mix up statically allocated const memory and non-statically allocated read-only caching (texture cache in cuda).

__const__ int i;

in CUDA runtime API maps to:

constexpr specialization_id<int> i;

in SYCL, and this will be enabled with this PR #7946

Now at the moment statically allocating cuda global memory is not possible in DPC++, and I do not know that it maps to anything in the SYCL spec. I.e. there is no SYCL analogue to:

__device__ int i;`

However my understanding of the first message in this issue is that this was not the request. The request appears to be able to malloc memory that is guaranteed to use the texture memory cache in the cuda case. This is largely overlapping with what this extension is for: #7397. I am now concerned that #7397 and the "restrict" properties usage in https://github.com/tiwaria1/llvm/blob/36d521e0edef3fab4444e2964c24aa5f10879f63/sycl/doc/extensions/proposed/sycl_ext_oneapi_kernel_arg_properties.asciidoc are close to being duplicates.
The difference from my reading would be that using

annotated_ptr(ptr_a, properties{,const restrict});

in a kernel would require in the cuda backend that we implicitly call the __ldg builtins and use the returned value whenever _ptr as accessed throughout the kernel.
At least from the point of view of the cuda backend, a possible downside of this approach is that it does not allow the user the flexibility of when to use the texture cache, which apparently can sometimes be desirable and was a feature request that we received: see the final comments here: https://gitlab.com/gromacs/gromacs/-/issues/3928

This brings up all kinds of questions:

What is the analogue of texture memory in Intel devices? Previously I got the impression that it didn't currently exist. Or perhaps it is quite different from how it is in Nvidia hardware?
If it exists then do you favour the approach described in this comment (Support buffer location on CUDA #5827 (comment)) for using it via an explicit request (as opposed to the other way where it is inferred from const T* restrict) compared to the ldg approach in [SYCL][CUDA] Introduce sycl_ext_oneapi_cuda_tex_cache_read extension #7397
Should we consider introducing an analogue of __device__ int i; for global memory in sycl?

cc @gmlueck

gmlueck · 2023-03-06T17:23:58Z

Should we consider introducing an analogue of device int i; for global memory in sycl?

Isn't this what device_global is for? See sycl_ext_oneapi_device_global.

Regarding the use of ldg vs. annotated_ptr and properties ... there is another option also. I think we are working on an extension that provides a general syntax to annotated loads and stores with a property that provides a caching hint. Maybe it makes sense to eventually integrate "ldg" into that. Tagging @Pennycook who (I think) has taken up this proposal recently after it was left dormant for a while.

JackAKirk · 2023-03-06T17:38:55Z

Should we consider introducing an analogue of device int i; for global memory in sycl?

Isn't this what device_global is for? See sycl_ext_oneapi_device_global.

Thanks, I didn't know about that extension. cc @jchlanda

Regarding the use of ldg vs. annotated_ptr and properties ... there is another option also. I think we are working on an extension that provides a general syntax to annotated loads and stores with a property that provides a caching hint. Maybe it makes sense to eventually integrate "ldg" into that. Tagging @Pennycook who (I think) has taken up this proposal recently after it was left dormant for a while.

OK that might be a good idea. It would be good to learn a bit more about caching on Intel devices. Is there some good documentation on this anywhere?

jle-quel · 2023-03-10T14:42:06Z

When this issue/feature request was opened, the ask was for a "temporary" feature to enable a developer to allocate memory in constant memory for the CUDA backend.

Looking at the first comments made to this PR, this feature would be supported via the new USM API extension.

The extension mentioned by @sherry-yuan is now closed and has been handed over to @jessicadavies-intel.

Is there any ongoing implementation work for these extensions?

Moreover, as @JackAKirk and @gmlueck mentioned, ongoing works are closely in line with this issue but not solving that issue "directly".

The question here is, where do we stand with this issue? Is this feature ask still relevant, only for the CUDA backend without the extensions (like I believe the issue was opened for)?

Is this being handled by the works mentioned in the latest comments?

zjin-lcf · 2023-04-11T21:24:39Z

@jle-quel

When a feature to enable a developer to allocate memory in constant memory for the CUDA backend is available, please let us know. Thank you.

In CUDA, users often allocate memory in constant memory in CUDA:

__device__ __constant__ int constant_data[1024];

zjin-lcf · 2023-04-11T21:38:52Z

I list the results on a V100 GPU for the example: https://github.com/zjin-lcf/HeCBench/blob/master/cmembench-cuda/main.cu

cuda
./main 1000
PASS Average kernel execution time (memory access width = 4 bytes): 0.028944 ms
PASS Average kernel execution time (memory access width = 8 bytes): 0.028764 ms
PASS Average kernel execution time (memory access width = 16 bytes): 0.026231 ms

dpct
./main 1000
PASS Average kernel execution time (memory access width = 4 bytes): 1.781998 ms
PASS Average kernel execution time (memory access width = 8 bytes): 1.106395 ms
PASS Average kernel execution time (memory access width = 16 bytes): 1.127009 ms

sherry-yuan added the enhancement New feature or request label Mar 16, 2022

HabKaffee self-assigned this Mar 21, 2022

HabKaffee added the cuda CUDA back-end label Mar 21, 2022

HabKaffee removed their assignment Mar 21, 2022

JackAKirk mentioned this issue Mar 30, 2023

[SYCL][CUDA] Introduce sycl_ext_oneapi_cuda_tex_cache_read extension #7397

Merged

jinz2014 mentioned this issue Aug 27, 2023

Mistranslation of CUDA constant memory oneapi-src/SYCLomatic#1145

Closed

JackAKirk mentioned this issue Nov 14, 2024

[sycl][cuda][hip] Expose const addrsp via device_global<T, decltype(properties{device_constant})> #16001

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support buffer location on CUDA #5827

Support buffer location on CUDA #5827

sherry-yuan commented Mar 16, 2022 •

edited

Loading

sherry-yuan commented Mar 18, 2022 •

edited

Loading

Uh oh!

zjin-lcf commented Mar 18, 2022

Uh oh!

sherry-yuan commented Mar 18, 2022 •

edited

Loading

Uh oh!

zjin-lcf commented Mar 18, 2022

Uh oh!

JackAKirk commented Mar 6, 2023 •

edited

Loading

Uh oh!

gmlueck commented Mar 6, 2023

Uh oh!

JackAKirk commented Mar 6, 2023

Uh oh!

jle-quel commented Mar 10, 2023

Uh oh!

zjin-lcf commented Apr 11, 2023 •

edited

Loading

Uh oh!

zjin-lcf commented Apr 11, 2023 •

edited

Loading

Uh oh!

Support buffer location on CUDA #5827

Support buffer location on CUDA #5827

Comments

sherry-yuan commented Mar 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

sherry-yuan commented Mar 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zjin-lcf commented Mar 18, 2022

Uh oh!

sherry-yuan commented Mar 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zjin-lcf commented Mar 18, 2022

Uh oh!

JackAKirk commented Mar 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gmlueck commented Mar 6, 2023

Uh oh!

JackAKirk commented Mar 6, 2023

Uh oh!

jle-quel commented Mar 10, 2023

Uh oh!

zjin-lcf commented Apr 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zjin-lcf commented Apr 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sherry-yuan commented Mar 16, 2022 •

edited

Loading

sherry-yuan commented Mar 18, 2022 •

edited

Loading

sherry-yuan commented Mar 18, 2022 •

edited

Loading

JackAKirk commented Mar 6, 2023 •

edited

Loading

zjin-lcf commented Apr 11, 2023 •

edited

Loading

zjin-lcf commented Apr 11, 2023 •

edited

Loading