-
Notifications
You must be signed in to change notification settings - Fork 163
Querying current device is slow compared to CuPy #439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
We'll have to cache CC on a per- In [32]: def get_cc(dev):
...: if dev in data:
...: return data[dev]
...: data[dev] = (driver.cuDeviceGetAttribute(driver.CUdevice_attribute.CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, dev)[1],
...: driver.cuDeviceGetAttribute(driver.CUdevice_attribute.CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, dev)[1])
...: return data[dev]
...:
In [33]:
In [33]: get_cc(1)
Out[33]: (12, 0)
In [36]: %timeit get_cc(1)
51.7 ns ± 0.0214 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
In [37]: %timeit cp.cuda.Device().compute_capability
179 ns ± 1.07 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each) which is also what CuPy does internally: |
I did some refactoring of In [19]: %timeit runtime.cudaGetDevice()
338 ns ± 0.463 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
In [20]: %timeit driver.cuCtxGetDevice()
406 ns ± 1.79 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
In [21]: %timeit cp.cuda.runtime.getDevice()
112 ns ± 0.822 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each) A simple get-device call using |
Accessing |
@rwgk reported that |
This line accounts for about 82% of the runtime of
I figured that out by replacing that line with hard-wired |
Small update: I made a few trivial performance changes that helped quite a bit, then compared the performance with and without this diff, all else the same: - err, device_id = runtime.cudaGetDevice()
- assert err == driver.CUresult.CUDA_SUCCESS
+ device_id = 0 # hard-wired
I.e. almost the entire remaining performance difference is due to |
Yes, see #439 (comment). Right now the problem is in cuda.bindings, not cuda.core. I had changed the issue label to reflect this status. |
Here's my investigation report. Version 1. This is a minimized version of
with results: Version 2. Creates the CUdevice, but skips returning the error code:
with results: Version 3. Add return error code:
with results: The problem is the creation of class CUresult.... it's really really slow:
Here's one more version of
With results: With regards to the next step... I do see that calling returning the enum directly gives better results:
Therefore I propose the following version:
With results: This gives our most common and important case much better performance, while the perf drop in case of an error is less significant. |
Wow! Great findings Vlad! It is insane how slow I wonder if it makes sense to build an internal cache ourselves? The built-in dict lookup is very fast (>10x). Something like _m = dict(((int(v), v) for k, v in driver.CUresult.__members__.items()))
def cuCtxGetDevice():
cdef CUdevice device = CUdevice()
err = int(cydriver.cuCtxGetDevice(<cydriver.CUdevice*>device._pvt_ptr))
return (_m(err), device) This is reasonably fast from what I see: In [51]: m = dict(((int(v), v) for k, v in driver.CUresult.__members__.items()))
In [52]: %timeit m[100]
20.5 ns ± 0.0777 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
In [53]: %timeit driver.CUresult(100) # for comparison
254 ns ± 1.06 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Alternatively, we can drop the built-in |
(Your fast path is also reasonable FWIW, just wonder if this is worth our efforts.) |
My take from comparing version 1 and version 2 is that we wasted 100% overhead (60->120ns) just to create a tuple... We may want to think seriously about breaking the API in the next major release. |
Here's one more version:
With results: Output arguments also add overhead. |
I read it wrong. Creating the return tuple is reasonable (~10 ns). |
Uh oh!
There was an error while loading. Please reload this page.
Getting the current device using
cuda.core
is quite a bit slower than CuPy:Ultimately, my goal is to get the compute capability of the current device, and this is even slower:
Are there tricks (e.g., caching) CuPy is employing here that
cuda.core
can use as well? Alternately, is there another way for me to usecuda.core
orcuda.bindings
to get this information quickly? Note that for my use case, I'm not super concerned about the first call toDevice()
, but I do want subsequent calls to be trivially inexpensive if the current device hasn't changed.Using the low-level cuda.bindings is also not quite as fast:
The text was updated successfully, but these errors were encountered: