-
Notifications
You must be signed in to change notification settings - Fork 162
fast error code #546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fast error code #546
Conversation
/ok to test |
This comment has been minimized.
This comment has been minimized.
@@ -1162,6 +1163,8 @@ class cudaError_t(IntEnum): | |||
cudaErrorUnknown = cyruntime.cudaError.cudaErrorUnknown{{endif}} | |||
{{if 'cudaErrorApiFailureBase' in found_values}} | |||
cudaErrorApiFailureBase = cyruntime.cudaError.cudaErrorApiFailureBase{{endif}} | |||
|
|||
_dict_cudaError_t = dict(((int(v), v) for k, v in cudaError_t.__members__.items())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want / need to squeeze out some extra juice... When we end up using these dicts we're always feeding cudaError_t
results from cyruntime
calls into the __getitem__
calls, here, which means I think we're unnecessarily boxing those into Python integers where we could directly translate the cudaError_t
to the returned Python object.
I think we could technically go as far as instead of using a dict here using a cython unordered_map[cudaError_t, object]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having trouble piping this suggestion through to check the performance. Using CUresult
as an example, the closest I got was:
cdef unordered_map[cydriver.CUresult, object] _dict_CUresult = dict(((int(v), v) for k, v in CUresult.__members__.items()))
with error Python object type 'Python object' cannot be used as a template argument
.
Overall I think the performance of this PR is sufficient enough to close out the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think to hold a Python object in a C++ container we need to do unsafe casting:
# distutils: language = c++
from libcpp.unordered_map cimport unordered_map
cimport cpython
from cuda.bindings import driver
cdef unordered_map[int, cpython.PyObject*] m
for v in driver.CUresult.__members__.values():
m[v] = <cpython.PyObject*><object>(v)
def get_m(result):
return <object>m[result]
Although in this case it should be alright, it is a bit nerve-wrecking. Also, it is not faster:
In [1]: import test_more
In [2]: %timeit test_more.get_m(0)
31.4 ns ± 0.0583 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
In [3]: %timeit test_more.get_m(100)
30.6 ns ± 0.0612 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
In [5]: from cuda.bindings import driver, runtime
In [6]: m = dict(((int(v), v) for _, v in driver.CUresult.__members__.items()))
In [7]: %timeit m[0]
20.5 ns ± 0.0179 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
In [8]: %timeit m[100]
20.5 ns ± 0.0533 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIF skipping attribute access saves us ~3 ns:
In [1]: from test_more import get_m
In [2]: %timeit get_m(0)
28.7 ns ± 0.0316 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only other idea is that because we're operating on fixed size enums, instead of hash table lookups via dicts, we could create a fixed size array and use the cudaError_t
value to index into the array.
I don't think we need to do this unless we find ourselves in a position of really needing more performance though.
f3e9d58
to
fdc36e1
Compare
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confirm that both driver/runtime APIs now have the same performance compared to without this PR (and the static linking PR), xref: #439 (comment):
In [1]: from cuda.bindings import driver, runtime
In [3]: %timeit runtime.cudaGetDevice()
130 ns ± 0.606 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
In [4]: %timeit driver.cuCtxGetDevice()
135 ns ± 0.796 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
In [5]: import cupy as cp
In [6]: %timeit cp.cuda.runtime.getDevice()
114 ns ± 0.558 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
@@ -85,6 +85,8 @@ class nvrtcResult(IntEnum): | |||
NVRTC_ERROR_PCH_CREATE = cynvrtc.nvrtcResult.NVRTC_ERROR_PCH_CREATE{{endif}} | |||
{{if 'NVRTC_ERROR_CANCELLED' in found_values}} | |||
NVRTC_ERROR_CANCELLED = cynvrtc.nvrtcResult.NVRTC_ERROR_CANCELLED{{endif}} | |||
|
|||
_dict_nvrtcResult = dict(((int(v), v) for k, v in nvrtcResult.__members__.items())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two nits
- micro optimization using
.values()
(reduce about 0.7us for creating the dict for CUresult) - declare it as
cdef
so that it is not accessible from Python
_dict_nvrtcResult = dict(((int(v), v) for k, v in nvrtcResult.__members__.items())) | |
cdef _dict_nvrtcResult = dict(((int(v), v) for v in nvrtcResult.__members__.values())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be addressed in a separate PR (or just ignore it).
|
close #439