Skip to content

[SYCL][CUDA] Handle the case of not having any CUDA device #1212

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 2, 2020

Conversation

fwyzard
Copy link
Contributor

@fwyzard fwyzard commented Feb 28, 2020

If there are no CUDA devices, do not return any CUDA-based SYCL platform rather than aborting or throwing an exception.

Fixes #1198.

Ruyk
Ruyk previously approved these changes Feb 28, 2020
Copy link
Contributor

@Ruyk Ruyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks better than my patch, which is now closed.

@bader bader added the cuda CUDA back-end label Feb 28, 2020
@@ -34,6 +34,8 @@ pi_result map_error(CUresult result) {
return PI_INVALID_OPERATION;
case CUDA_ERROR_INVALID_CONTEXT:
return PI_INVALID_CONTEXT;
case CUDA_ERROR_NO_DEVICE:
Copy link
Contributor

@bjoernknafla bjoernknafla Feb 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error code seems to only come up with CUDA-OpenGL interop according to the docs - is that something you see "in the wild" otherwise?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry - this is actually a leftover from the first attempt I made, that was returning PI_INVALID_PLATFORM if there are no CUDA devices.

It should be fine to drop these two lines.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if PI_INVALID_PLATFORM is the best match for the device error as I am unsure which OpenCL (extension) call would relate to OpenGL in this case?

initFlag,
[](pi_result &err) {
CUresult result = cuInit(0);
if (result == CUDA_ERROR_NO_DEVICE) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not seeing this error code in the 10.2.89 CUDA Driver API docs docs - does an earlier version return it?

Should we instead check as follows?

Suggested change
if (result == CUDA_ERROR_NO_DEVICE) {
if (result != CUDA_SUCCESS) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not in the docs, but it is returned when there are no available devices (I'm testing with CUDA 10.2).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was to special case a working installation which simply has no devices (for example because of CUDA_VISIBLE_DEVICES=).

I guess it also makes sense not to return the platform, instead of callingabort(), if cuInit() fails for other reasons.

I can make the change (and remove the call to PI_CHECK_ERROR(result)) if this is the preferred behaviour.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that cuInit() should be allowed to fail for any reason. At that point we should just not get any CUDA devices at runtime. The application should still be allowed to sue OpenCL/host devices at that point.

So yeah, I would just check with CUDA_SUCCESS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm CUDA_ERROR_NO_DEVICE is reported by cuInit, despite of it not being documented.
I think some of the other error causes are worth of throwing error, otherwise users will simply get "no cuda platform" without any other information.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😕 Do you think production code should behave that way? If I want to run an application on "some platform", I mainly want it to run. If it can't use any CUDA devices, then fine, it should still use any other device that it can, and not just die on me.

So an exception is really not what I would go for here. For whatever reason the cuInit() function is dying. At the most I would print an "info message" if the return value of cuInit() is something really non-expected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some of the other error causes are worth of throwing error, otherwise users will simply get "no cuda platform" without any other information.

I do not know how the plugins should work in the long term.
For the moment I understand that - if they are built - both the OpenCL and the CUDA plugins will be used, without a way to select/deselect them at runtime.

Now, the OpenCL plugin seems happy in not reporting any platforms for which it doesn't find any devices.

I think the CUDA plugin should never abort (or throw an exception that causes the runtime to do so); for example because that would break also SYCL applications that do not care about the PTX/CUDA backend.

Whether to print an error or silently report no CUDA devices (and thus no CUDA platform) is less clear, but I would suggest to use the same approach as the OpenCL plugin (i.e. be silent) and let the application or the user deal with the lack of devices.

If, on the other hand, one prefers a different behaviour (e.g. warning about a CUDA installation without device, or an NVIDIA device without CUDA drivers, etc.) I think the OpenCL backend should behave in the same way with respect to OpenCL devices.

@bjoernknafla
Copy link
Contributor

Should we check in cuda_piDevicesGet if there is a platform and return PI_INVALID_PLATFORM otherwise?

@bjoernknafla
Copy link
Contributor

Studying OpenCL 2.1 specs:

  • Without the cl_khr_icd extension (which we do not report/support) there is now way to not return a platform without an error code, but no error code exists to report that a platform does not exist...

To not deviate too much from the OpenCL spec, can we really support a 0 for number of platforms? Or do we always have to return a platform but then not report any devices?

@fwyzard
Copy link
Contributor Author

fwyzard commented Feb 28, 2020

Is it possible to call cuda_piDevicesGet if cuda_piPlatformsGet does not return any platforms ?

@bjoernknafla
Copy link
Contributor

No, though at the moment we always return a platform - even if we set the internal number to 0 - and I think we have to return a platform if we do not want to return an error. And I believe we do not want to return an error if there are no devices as this might trip up an implementation that is ok with finding any platform with devices even if CUDA isn't the one.

@fwyzard
Copy link
Contributor Author

fwyzard commented Feb 28, 2020

For what is worth, my interpretation is that clGetPlatformIDs returns (ignoring the cases about invalid inputs and out-of-memory conditions):


if the cl_khr_icd extension is enabled:

  • CL_SUCCESS if the function is executed successfully and there are a non zero number of platforms available.
  • CL_PLATFORM_NOT_FOUND_KHR if no platforms are found.

if the cl_khr_icd extension is not enabled:

  • CL_SUCCESS if the function is executed successfully.

I do not know if "the function is executed successfully" implies that at least one platform with at least one device should be available.

@bjoernknafla
Copy link
Contributor

Just checked the only part of SYCL that calls piPlatformsGet and it checks the number of returned platforms.

This should therefore be fine for the SYCL implementation but we have to check out unit tests.

@bjoernknafla
Copy link
Contributor

@fwyzard thank you for the back and forth to allow me to wrap my head around the intricacies!

bjoernknafla
bjoernknafla previously approved these changes Feb 28, 2020
@fwyzard
Copy link
Contributor Author

fwyzard commented Feb 28, 2020

My pleasure.

By the way, here is what I get on a machine with both OpenCL and CUDA available:

Available SYCL platforms:
  - Intel(R) OpenCL, driver version OpenCL 2.1 LINUX
    - Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz, driver version 2020.9.1.0.18

  - Intel(R) OpenCL HD Graphics, driver version OpenCL 2.1 
    - Intel(R) Gen9 HD Graphics NEO, driver version 20.07.15711

  - NVIDIA CUDA, driver version CUDA 10.20
    - Tesla K40c, driver version CUDA 10.20

  - SYCL host platform, driver version 1.2
    - SYCL host device, driver version 1.2

and here is what I get with the proposed changes, if I "hide" the NVIDIA GPU via CUDA_VISIBLE_DEVICES=:

Available SYCL platforms:
  - Intel(R) OpenCL, driver version OpenCL 2.1 LINUX
    - Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz, driver version 2020.9.1.0.18

  - Intel(R) OpenCL HD Graphics, driver version OpenCL 2.1 
    - Intel(R) Gen9 HD Graphics NEO, driver version 20.07.15711

  - SYCL host platform, driver version 1.2
    - SYCL host device, driver version 1.2

which looks like the behaviour I would expect :-)

Copy link
Contributor

@bader bader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a couple of nits.

bader
bader previously approved these changes Mar 2, 2020
If CUDA initialisation fails or there are no CUDA devices, do not return
a CUDA-based SYCL platform rather than aborting or throwing an exception.

Signed-off-by: Andrea Bocci <[email protected]>
@fwyzard fwyzard dismissed stale reviews from bader and bjoernknafla via 16ba1c5 March 2, 2020 11:25
@bader bader merged commit 745e759 into intel:sycl Mar 2, 2020
alexbatashev pushed a commit to alexbatashev/llvm that referenced this pull request Mar 4, 2020
…ctor_tests

* origin/sycl: (32 commits)
  [SYCL] Fix circular reference between events and queues (intel#1226)
  [CI][Doc] Use SSH to deploy GitHub Pages (intel#1232)
  [SYCL][CUDA][Test] Testing for use of CUDA primary context (intel#1174)
  [SYCL] allow underscore symbol in temporary directory name
  [SYCL] Reject zero length arrays (intel#1153)
  [SYCL] Fix static code analyzis concerns (intel#1189)
  [SYCL] Add more details about the -fintelfpga option (intel#1218)
  [SYCL][CUDA] Select only NVPTX64 device binaries (intel#1223)
  [SYCL] Reverse max work-group size order (intel#1177)
  [SYCL][Doc] Add GroupAlgorithms extension (intel#1079)
  [SYCL] Fix SYCL internal enumerators conflict with user defined macro (intel#1188)
  [SYCL][CUDA] Fixes context release and unnamed context scope (intel#1207)
  [SYCL][CUDA] Fix context creation property parsing
  [CUDA][PI] clang-format pi.h
  [SYCL][CUDA] Handle the case of not having any CUDA device (intel#1212)
  [SYCL] Fix check-sycl-deploy target problems (intel#1165)
  [SYCL] Disable tests which take more than 5 minutes (intel#1220)
  [SYCL] Make context constructors explicit to avoid unintended conversions (intel#1219)
  [SYCL][NFC] Add clang-format configuration file for SYCL LIT tests (intel#1224)
  [SYCL] Fix command cleanup invoked from multiple threads (intel#1214)
  ...
@fwyzard fwyzard deleted the fix_issue_1198 branch March 11, 2020 21:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda CUDA back-end
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Using cl::sycl::platform::get_platforms() on a machine without a CUDA or OpenCL device
6 participants