Skip to content

fast error code #546

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 5, 2025
Merged

fast error code #546

merged 1 commit into from
Apr 5, 2025

Conversation

vzhurba01
Copy link
Collaborator

close #439

@vzhurba01 vzhurba01 added enhancement Any code-related improvements P1 Medium priority - Should do cuda.bindings Everything related to the cuda.bindings module labels Apr 3, 2025
@vzhurba01 vzhurba01 added this to the cuda-python 12.9.0 & 11.8.7 milestone Apr 3, 2025
@vzhurba01 vzhurba01 self-assigned this Apr 3, 2025
Copy link
Contributor

copy-pr-bot bot commented Apr 3, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@vzhurba01
Copy link
Collaborator Author

/ok to test

@vzhurba01
Copy link
Collaborator Author

Commit f3e9d58 demonstrates changes unique to this PR. This PR is dirty since it contains auto-generated code from #544.

This comment has been minimized.

@@ -1162,6 +1163,8 @@ class cudaError_t(IntEnum):
cudaErrorUnknown = cyruntime.cudaError.cudaErrorUnknown{{endif}}
{{if 'cudaErrorApiFailureBase' in found_values}}
cudaErrorApiFailureBase = cyruntime.cudaError.cudaErrorApiFailureBase{{endif}}

_dict_cudaError_t = dict(((int(v), v) for k, v in cudaError_t.__members__.items()))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want / need to squeeze out some extra juice... When we end up using these dicts we're always feeding cudaError_t results from cyruntime calls into the __getitem__ calls, here, which means I think we're unnecessarily boxing those into Python integers where we could directly translate the cudaError_t to the returned Python object.

I think we could technically go as far as instead of using a dict here using a cython unordered_map[cudaError_t, object].

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having trouble piping this suggestion through to check the performance. Using CUresult as an example, the closest I got was:

cdef unordered_map[cydriver.CUresult, object] _dict_CUresult = dict(((int(v), v) for k, v in CUresult.__members__.items()))

with error Python object type 'Python object' cannot be used as a template argument.

Overall I think the performance of this PR is sufficient enough to close out the issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think to hold a Python object in a C++ container we need to do unsafe casting:

# distutils: language = c++

from libcpp.unordered_map cimport unordered_map
cimport cpython

from cuda.bindings import driver


cdef unordered_map[int, cpython.PyObject*] m
for v in driver.CUresult.__members__.values():
    m[v] = <cpython.PyObject*><object>(v)


def get_m(result):
    return <object>m[result]

Although in this case it should be alright, it is a bit nerve-wrecking. Also, it is not faster:

In [1]: import test_more

In [2]: %timeit test_more.get_m(0)
31.4 ns ± 0.0583 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [3]: %timeit test_more.get_m(100)
30.6 ns ± 0.0612 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [5]: from cuda.bindings import driver, runtime

In [6]: m = dict(((int(v), v) for _, v in driver.CUresult.__members__.items()))

In [7]: %timeit m[0]
20.5 ns ± 0.0179 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [8]: %timeit m[100]
20.5 ns ± 0.0533 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIF skipping attribute access saves us ~3 ns:

In [1]: from test_more import get_m

In [2]: %timeit get_m(0)
28.7 ns ± 0.0316 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only other idea is that because we're operating on fixed size enums, instead of hash table lookups via dicts, we could create a fixed size array and use the cudaError_t value to index into the array.

I don't think we need to do this unless we find ourselves in a position of really needing more performance though.

@vzhurba01 vzhurba01 changed the title 439 fast error code fast error code Apr 4, 2025
@vzhurba01 vzhurba01 force-pushed the 439-fast-error-code branch from f3e9d58 to fdc36e1 Compare April 4, 2025 20:06
@vzhurba01
Copy link
Collaborator Author

/ok to test

Copy link
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirm that both driver/runtime APIs now have the same performance compared to without this PR (and the static linking PR), xref: #439 (comment):

In [1]: from cuda.bindings import driver, runtime

In [3]: %timeit runtime.cudaGetDevice()
130 ns ± 0.606 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [4]: %timeit driver.cuCtxGetDevice()
135 ns ± 0.796 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [5]: import cupy as cp

In [6]: %timeit cp.cuda.runtime.getDevice()
114 ns ± 0.558 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

@@ -85,6 +85,8 @@ class nvrtcResult(IntEnum):
NVRTC_ERROR_PCH_CREATE = cynvrtc.nvrtcResult.NVRTC_ERROR_PCH_CREATE{{endif}}
{{if 'NVRTC_ERROR_CANCELLED' in found_values}}
NVRTC_ERROR_CANCELLED = cynvrtc.nvrtcResult.NVRTC_ERROR_CANCELLED{{endif}}

_dict_nvrtcResult = dict(((int(v), v) for k, v in nvrtcResult.__members__.items()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two nits

  1. micro optimization using .values() (reduce about 0.7us for creating the dict for CUresult)
  2. declare it as cdef so that it is not accessible from Python
Suggested change
_dict_nvrtcResult = dict(((int(v), v) for k, v in nvrtcResult.__members__.items()))
cdef _dict_nvrtcResult = dict(((int(v), v) for v in nvrtcResult.__members__.values()))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be addressed in a separate PR (or just ignore it).

@leofang leofang merged commit 1256bc1 into NVIDIA:main Apr 5, 2025
73 checks passed
Copy link

github-actions bot commented Apr 5, 2025

Doc Preview CI
Preview removed because the pull request was closed or merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda.bindings Everything related to the cuda.bindings module enhancement Any code-related improvements P1 Medium priority - Should do
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Querying current device is slow compared to CuPy
3 participants