Skip to content

Add @functools.lru_cache decorator for get_binding_version() #512

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 12, 2025

Conversation

rwgk
Copy link
Collaborator

@rwgk rwgk commented Mar 12, 2025

This one-line change results in a 8k+ fold speedup.

>>> 35.381859/0.004149
8527.804049168473
$ git stash
$ python test_slowness.py 100000
driver.cuDriverGetVersion() 12060
cuda_utils.get_binding_version() (12, 8)
driver.cuDriverGetVersion()
    0.023946 seconds for 100000 iterations
    0.24 µs per call
cuda_utils.get_binding_version()
    35.381859 seconds for 100000 iterations
    353.82 µs per call
$ git stash pop
$ python test_slowness.py 100000
driver.cuDriverGetVersion() 12060
cuda_utils.get_binding_version() (12, 8)
driver.cuDriverGetVersion()
    0.022644 seconds for 100000 iterations
    0.23 µs per call
cuda_utils.get_binding_version()
    0.004149 seconds for 100000 iterations
    0.04 µs per call

In retrospect, I should have just looked at the get_bindings() implementation immediately.

The way I actually found this (perf version 6.8.12):

perf record -F 99 -g -- python test_slowness.py 100000
perf report

I gave the top of the perf report and the get_bindings() implementation to ChatGPT:

That made it immediately obvious that importlib.metadata.version("cuda-bindings") is the bottleneck, mainly because it involves regex calls, but also because it triggers filesystem calls.


import sys
import time

from cuda.bindings import driver
from cuda.core.experimental._utils import cuda_utils


class show_timings:
    def __init__(self, num_iters, label):
        self.label = label
        self.num_iters = num_iters

    def __enter__(self):
        self.start = time.perf_counter()
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        elapsed = time.perf_counter() - self.start
        print(self.label)
        if self.num_iters:
            print(f"    {elapsed:.6f} seconds for {self.num_iters} iterations")
            print(f"    {(elapsed / self.num_iters) * 1e6:.2f} µs per call")
        else:
            print(f"    {elapsed:.6f} seconds")


err, dv = driver.cuDriverGetVersion()
assert err == driver.CUresult.CUDA_SUCCESS
print("driver.cuDriverGetVersion()", dv, flush=True)

bv = cuda_utils.get_binding_version()
print("cuda_utils.get_binding_version()", bv, flush=True)

num_iters = int(sys.argv[1])

if 1:
    with show_timings(num_iters, "driver.cuDriverGetVersion()"):
        for _ in range(num_iters):
            driver.cuDriverGetVersion()
if 1:
    with show_timings(num_iters, "cuda_utils.get_binding_version()"):
        for _ in range(num_iters):
            cuda_utils.get_binding_version()

>>> 35.381859/0.004149
8527.804049168473

$ git stash
$ python test_slowness.py 100000
driver.cuDriverGetVersion() 12060
cuda_utils.get_binding_version() (12, 8)
driver.cuDriverGetVersion()
    0.023946 seconds for 100000 iterations
    0.24 µs per call
cuda_utils.get_binding_version()
    35.381859 seconds for 100000 iterations
    353.82 µs per call

$ git stash pop
$ python test_slowness.py 100000
driver.cuDriverGetVersion() 12060
cuda_utils.get_binding_version() (12, 8)
driver.cuDriverGetVersion()
    0.022644 seconds for 100000 iterations
    0.23 µs per call
cuda_utils.get_binding_version()
    0.004149 seconds for 100000 iterations
    0.04 µs per call
Copy link
Contributor

copy-pr-bot bot commented Mar 12, 2025

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@rwgk rwgk self-assigned this Mar 12, 2025
@rwgk
Copy link
Collaborator Author

rwgk commented Mar 12, 2025

/ok to test

@rwgk rwgk requested a review from ksimpson-work March 12, 2025 22:10
@rwgk
Copy link
Collaborator Author

rwgk commented Mar 12, 2025

@shwina for visibility

This PR does not fix #439

This comment has been minimized.

@leofang leofang added this to the cuda.core beta 3 milestone Mar 12, 2025
@leofang leofang added enhancement Any code-related improvements P0 High priority - Must do! cuda.core Everything related to the cuda.core module labels Mar 12, 2025
@rwgk rwgk merged commit f903d98 into NVIDIA:main Mar 12, 2025
76 checks passed
@leofang
Copy link
Member

leofang commented Mar 12, 2025

Thanks, Ralf! How did you notice the slowness?

Copy link

Doc Preview CI
Preview removed because the pull request was closed or merged.

@rwgk
Copy link
Collaborator Author

rwgk commented Mar 13, 2025

Thanks, Ralf! How did you notice the slowness?

When I was working on this:

c789bf6

Originally I had the _utils.get_binding_version() call inside the loop where you see if _BINDING_VERSION >= (12, 0):.

Today I was hoping fixing that very obvious problem first would help with #439 as well. But no, that's something different, and not nearly as extreme (5x vs 8500x).

@rwgk rwgk deleted the get_binding_version_8k branch March 13, 2025 03:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda.core Everything related to the cuda.core module enhancement Any code-related improvements P0 High priority - Must do!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants