fast error code #546

vzhurba01 · 2025-04-03T23:59:22Z

close #439

copy-pr-bot · 2025-04-03T23:59:26Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

vzhurba01 · 2025-04-03T23:59:49Z

/ok to test

vzhurba01 · 2025-04-04T00:00:58Z

Commit f3e9d58 demonstrates changes unique to this PR. This PR is dirty since it contains auto-generated code from #544.

cuda_bindings/cuda/bindings/runtime.pyx.in

kkraus14 · 2025-04-04T01:42:09Z

cuda_bindings/cuda/bindings/runtime.pyx.in

@@ -1162,6 +1163,8 @@ class cudaError_t(IntEnum):
    cudaErrorUnknown = cyruntime.cudaError.cudaErrorUnknown{{endif}}
    {{if 'cudaErrorApiFailureBase' in found_values}}
    cudaErrorApiFailureBase = cyruntime.cudaError.cudaErrorApiFailureBase{{endif}}
+
+_dict_cudaError_t = dict(((int(v), v) for k, v in cudaError_t.__members__.items()))


If we want / need to squeeze out some extra juice... When we end up using these dicts we're always feeding cudaError_t results from cyruntime calls into the __getitem__ calls, here, which means I think we're unnecessarily boxing those into Python integers where we could directly translate the cudaError_t to the returned Python object.

I think we could technically go as far as instead of using a dict here using a cython unordered_map[cudaError_t, object].

I'm having trouble piping this suggestion through to check the performance. Using CUresult as an example, the closest I got was:

cdef unordered_map[cydriver.CUresult, object] _dict_CUresult = dict(((int(v), v) for k, v in CUresult.__members__.items()))

with error Python object type 'Python object' cannot be used as a template argument.

Overall I think the performance of this PR is sufficient enough to close out the issue.

I think to hold a Python object in a C++ container we need to do unsafe casting:

# distutils: language = c++ from libcpp.unordered_map cimport unordered_map cimport cpython from cuda.bindings import driver cdef unordered_map[int, cpython.PyObject*] m for v in driver.CUresult.__members__.values(): m[v] = <cpython.PyObject*><object>(v) def get_m(result): return <object>m[result]

Although in this case it should be alright, it is a bit nerve-wrecking. Also, it is not faster:

In [1]: import test_more In [2]: %timeit test_more.get_m(0) 31.4 ns ± 0.0583 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each) In [3]: %timeit test_more.get_m(100) 30.6 ns ± 0.0612 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each) In [5]: from cuda.bindings import driver, runtime In [6]: m = dict(((int(v), v) for _, v in driver.CUresult.__members__.items())) In [7]: %timeit m[0] 20.5 ns ± 0.0179 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each) In [8]: %timeit m[100] 20.5 ns ± 0.0533 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

FWIF skipping attribute access saves us ~3 ns:

In [1]: from test_more import get_m In [2]: %timeit get_m(0) 28.7 ns ± 0.0316 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

Only other idea is that because we're operating on fixed size enums, instead of hash table lookups via dicts, we could create a fixed size array and use the cudaError_t value to index into the array.

I don't think we need to do this unless we find ourselves in a position of really needing more performance though.

vzhurba01 · 2025-04-04T20:07:12Z

/ok to test

leofang

I confirm that both driver/runtime APIs now have the same performance compared to without this PR (and the static linking PR), xref: #439 (comment):

In [1]: from cuda.bindings import driver, runtime

In [3]: %timeit runtime.cudaGetDevice()
130 ns ± 0.606 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [4]: %timeit driver.cuCtxGetDevice()
135 ns ± 0.796 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [5]: import cupy as cp

In [6]: %timeit cp.cuda.runtime.getDevice()
114 ns ± 0.558 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

leofang · 2025-04-04T21:01:45Z

cuda_bindings/cuda/bindings/nvrtc.pyx.in

@@ -85,6 +85,8 @@ class nvrtcResult(IntEnum):
    NVRTC_ERROR_PCH_CREATE = cynvrtc.nvrtcResult.NVRTC_ERROR_PCH_CREATE{{endif}}
    {{if 'NVRTC_ERROR_CANCELLED' in found_values}}
    NVRTC_ERROR_CANCELLED = cynvrtc.nvrtcResult.NVRTC_ERROR_CANCELLED{{endif}}
+
+_dict_nvrtcResult = dict(((int(v), v) for k, v in nvrtcResult.__members__.items()))


two nits

micro optimization using .values() (reduce about 0.7us for creating the dict for CUresult)

declare it as cdef so that it is not accessible from Python

Suggested change

_dict_nvrtcResult = dict(((int(v), v) for k, v in nvrtcResult.__members__.items()))

cdef _dict_nvrtcResult = dict(((int(v), v) for v in nvrtcResult.__members__.values()))

This can be addressed in a separate PR (or just ignore it).

github-actions · 2025-04-05T00:23:11Z

Doc Preview CI
Preview removed because the pull request was closed or merged.

vzhurba01 added enhancement Any code-related improvements P1 Medium priority - Should do cuda.bindings Everything related to the cuda.bindings module labels Apr 3, 2025

vzhurba01 added this to the cuda-python 12.9.0 & 11.8.7 milestone Apr 3, 2025

vzhurba01 self-assigned this Apr 3, 2025

This comment has been minimized.

Sign in to view

kkraus14 reviewed Apr 4, 2025

View reviewed changes

vzhurba01 changed the title ~~439 fast error code~~ fast error code Apr 4, 2025

Use dict to return Enum from Cython to Python

fdc36e1

vzhurba01 force-pushed the 439-fast-error-code branch from f3e9d58 to fdc36e1 Compare April 4, 2025 20:06

kkraus14 approved these changes Apr 4, 2025

View reviewed changes

leofang approved these changes Apr 4, 2025

View reviewed changes

leofang merged commit 1256bc1 into NVIDIA:main Apr 5, 2025
73 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fast error code #546

fast error code #546

Uh oh!

vzhurba01 commented Apr 3, 2025

Uh oh!

copy-pr-bot bot commented Apr 3, 2025

Uh oh!

vzhurba01 commented Apr 3, 2025

Uh oh!

vzhurba01 commented Apr 4, 2025

Uh oh!

This comment has been minimized.

Uh oh!

kkraus14 Apr 4, 2025

Uh oh!

vzhurba01 Apr 4, 2025

Uh oh!

leofang Apr 4, 2025

Uh oh!

leofang Apr 4, 2025

Uh oh!

kkraus14 Apr 4, 2025

Uh oh!

vzhurba01 commented Apr 4, 2025

Uh oh!

leofang left a comment •

edited

Loading

Uh oh!

leofang Apr 4, 2025

Uh oh!

leofang Apr 5, 2025

Uh oh!

Uh oh!

github-actions bot commented Apr 5, 2025

Uh oh!

Uh oh!

	_dict_nvrtcResult = dict(((int(v), v) for k, v in nvrtcResult.__members__.items()))
	cdef _dict_nvrtcResult = dict(((int(v), v) for v in nvrtcResult.__members__.values()))

fast error code #546

fast error code #546

Uh oh!

Conversation

vzhurba01 commented Apr 3, 2025

Uh oh!

copy-pr-bot bot commented Apr 3, 2025

Uh oh!

vzhurba01 commented Apr 3, 2025

Uh oh!

vzhurba01 commented Apr 4, 2025

Uh oh!

This comment has been minimized.

Uh oh!

kkraus14 Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

vzhurba01 Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

leofang Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

leofang Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

kkraus14 Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

vzhurba01 commented Apr 4, 2025

Uh oh!

leofang left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leofang Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

leofang Apr 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Apr 5, 2025

Uh oh!

Uh oh!

leofang left a comment •

edited

Loading