Skip to content

Support buffer location on CUDA #5827

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
sherry-yuan opened this issue Mar 16, 2022 · 10 comments
Open

Support buffer location on CUDA #5827

sherry-yuan opened this issue Mar 16, 2022 · 10 comments
Labels
cuda CUDA back-end enhancement New feature or request

Comments

@sherry-yuan
Copy link
Contributor

sherry-yuan commented Mar 16, 2022

Cuda support allocation in global memory and cache, the ask is to let cuda usm allocation reserve cache memory when buffer location property of value of 4 is passed in.

change will need to be made here:

pi_result cuda_piextUSMDeviceAlloc(void **result_ptr, pi_context context,
pi_device device,
pi_usm_mem_properties *properties,
size_t size, pi_uint32 alignment) {
assert(result_ptr != nullptr);
assert(context != nullptr);
assert(device != nullptr);
assert(properties == nullptr);
pi_result result = PI_SUCCESS;
try {
ScopedContext active(context);
result = PI_CHECK_ERROR(cuMemAlloc((CUdeviceptr *)result_ptr, size));
} catch (pi_result error) {
result = error;
}
assert(alignment == 0 ||
(result == PI_SUCCESS &&
reinterpret_cast<std::uintptr_t>(*result_ptr) % alignment == 0));
return result;
}

enum class target {
  device,
  host_task,
  global_buffer = device, // Deprecated
  constant_buffer, // Deprecated
  local, // Deprecated
  host_buffer // Deprecated
};
@sherry-yuan sherry-yuan added the enhancement New feature or request label Mar 16, 2022
@sherry-yuan
Copy link
Contributor Author

sherry-yuan commented Mar 18, 2022

Eventually, this will be supported through the full solution to new usm API (spec here: #5656). Once that is done, the properties of properties{noalias, read_only} can be passed to annotated_ptr (spec here: #5755) such that compilers can make additional optimization in where the memory is allocated.

queue q;
auto int *t = malloc_device<int>(N, q, property p{t, properties{noalias, read_only}})
// above returns an annotated_ptr with property p{t, properties{noalias, read_only}};
q.single_task([=] {
     … = p[…]; // The compiler knows it’s safe to cache p because it never changes
});

Thanks to @GarveyJoe for the suggested solution above!

@zjin-lcf
Copy link
Contributor

In the code snippet, the compiler returns the read-only property only after it analyzes the kernel. Is that right ?

@sherry-yuan
Copy link
Contributor Author

sherry-yuan commented Mar 18, 2022

I believe it is user who provide the property to malloc API (I updated the example code a little so that it is clear). The user provided information (readonly & noalias) can be used in compiler in deciding whether the optimization (eg allocating it in constant memory) is possible.

@zjin-lcf
Copy link
Contributor

It is clear.
Reading CUDA programs shows that developers may allocate CUDA device or constant memory statically in the global scope. This feature is helpful for porting CUDA programs.

Thanks.

@HabKaffee HabKaffee self-assigned this Mar 21, 2022
@HabKaffee HabKaffee added the cuda CUDA back-end label Mar 21, 2022
@HabKaffee HabKaffee removed their assignment Mar 21, 2022
@JackAKirk
Copy link
Contributor

JackAKirk commented Mar 6, 2023

It is clear. Reading CUDA programs shows that developers may allocate CUDA device or constant memory statically in the global scope. This feature is helpful for porting CUDA programs.

Thanks.

Let's be careful not to mix up statically allocated const memory and non-statically allocated read-only caching (texture cache in cuda).

__const__ int i;

in CUDA runtime API maps to:

constexpr specialization_id<int> i;

in SYCL, and this will be enabled with this PR #7946

Now at the moment statically allocating cuda global memory is not possible in DPC++, and I do not know that it maps to anything in the SYCL spec. I.e. there is no SYCL analogue to:

__device__ int i;` 

However my understanding of the first message in this issue is that this was not the request. The request appears to be able to malloc memory that is guaranteed to use the texture memory cache in the cuda case. This is largely overlapping with what this extension is for: #7397. I am now concerned that #7397 and the "restrict" properties usage in https://github.com/tiwaria1/llvm/blob/36d521e0edef3fab4444e2964c24aa5f10879f63/sycl/doc/extensions/proposed/sycl_ext_oneapi_kernel_arg_properties.asciidoc are close to being duplicates.
The difference from my reading would be that using

annotated_ptr(ptr_a, properties{,const restrict});

in a kernel would require in the cuda backend that we implicitly call the __ldg builtins and use the returned value whenever _ptr as accessed throughout the kernel.
At least from the point of view of the cuda backend, a possible downside of this approach is that it does not allow the user the flexibility of when to use the texture cache, which apparently can sometimes be desirable and was a feature request that we received: see the final comments here: https://gitlab.com/gromacs/gromacs/-/issues/3928

This brings up all kinds of questions:

  • What is the analogue of texture memory in Intel devices? Previously I got the impression that it didn't currently exist. Or perhaps it is quite different from how it is in Nvidia hardware?
  • If it exists then do you favour the approach described in this comment (Support buffer location on CUDA #5827 (comment)) for using it via an explicit request (as opposed to the other way where it is inferred from const T* restrict) compared to the ldg approach in [SYCL][CUDA] Introduce sycl_ext_oneapi_cuda_tex_cache_read extension  #7397
  • Should we consider introducing an analogue of __device__ int i; for global memory in sycl?

cc @gmlueck

@gmlueck
Copy link
Contributor

gmlueck commented Mar 6, 2023

Should we consider introducing an analogue of device int i; for global memory in sycl?

Isn't this what device_global is for? See sycl_ext_oneapi_device_global.

Regarding the use of ldg vs. annotated_ptr and properties ... there is another option also. I think we are working on an extension that provides a general syntax to annotated loads and stores with a property that provides a caching hint. Maybe it makes sense to eventually integrate "ldg" into that. Tagging @Pennycook who (I think) has taken up this proposal recently after it was left dormant for a while.

@JackAKirk
Copy link
Contributor

Should we consider introducing an analogue of device int i; for global memory in sycl?

Isn't this what device_global is for? See sycl_ext_oneapi_device_global.

Thanks, I didn't know about that extension. cc @jchlanda

Regarding the use of ldg vs. annotated_ptr and properties ... there is another option also. I think we are working on an extension that provides a general syntax to annotated loads and stores with a property that provides a caching hint. Maybe it makes sense to eventually integrate "ldg" into that. Tagging @Pennycook who (I think) has taken up this proposal recently after it was left dormant for a while.

OK that might be a good idea. It would be good to learn a bit more about caching on Intel devices. Is there some good documentation on this anywhere?

@jle-quel
Copy link
Contributor

When this issue/feature request was opened, the ask was for a "temporary" feature to enable a developer to allocate memory in constant memory for the CUDA backend.

Looking at the first comments made to this PR, this feature would be supported via the new USM API extension.

The extension mentioned by @sherry-yuan is now closed and has been handed over to @jessicadavies-intel.

Is there any ongoing implementation work for these extensions?

Moreover, as @JackAKirk and @gmlueck mentioned, ongoing works are closely in line with this issue but not solving that issue "directly".

The question here is, where do we stand with this issue? Is this feature ask still relevant, only for the CUDA backend without the extensions (like I believe the issue was opened for)?

Is this being handled by the works mentioned in the latest comments?

@zjin-lcf
Copy link
Contributor

zjin-lcf commented Apr 11, 2023

@jle-quel

When a feature to enable a developer to allocate memory in constant memory for the CUDA backend is available, please let us know. Thank you.

In CUDA, users often allocate memory in constant memory in CUDA:

__device__ __constant__ int constant_data[1024];

@zjin-lcf
Copy link
Contributor

zjin-lcf commented Apr 11, 2023

I list the results on a V100 GPU for the example: https://github.com/zjin-lcf/HeCBench/blob/master/cmembench-cuda/main.cu

cuda
./main 1000
PASS Average kernel execution time (memory access width = 4 bytes): 0.028944 ms
PASS Average kernel execution time (memory access width = 8 bytes): 0.028764 ms
PASS Average kernel execution time (memory access width = 16 bytes): 0.026231 ms

dpct
./main 1000
PASS Average kernel execution time (memory access width = 4 bytes): 1.781998 ms
PASS Average kernel execution time (memory access width = 8 bytes): 1.106395 ms
PASS Average kernel execution time (memory access width = 16 bytes): 1.127009 ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda CUDA back-end enhancement New feature or request
Projects
None yet
6 participants