Skip to content

Conversation

@rwgk
Copy link
Collaborator

@rwgk rwgk commented Mar 12, 2025

This one-line change results in a 8k+ fold speedup.

>>> 35.381859/0.004149
8527.804049168473
$ git stash
$ python test_slowness.py 100000
driver.cuDriverGetVersion() 12060
cuda_utils.get_binding_version() (12, 8)
driver.cuDriverGetVersion()
    0.023946 seconds for 100000 iterations
    0.24 µs per call
cuda_utils.get_binding_version()
    35.381859 seconds for 100000 iterations
    353.82 µs per call
$ git stash pop
$ python test_slowness.py 100000
driver.cuDriverGetVersion() 12060
cuda_utils.get_binding_version() (12, 8)
driver.cuDriverGetVersion()
    0.022644 seconds for 100000 iterations
    0.23 µs per call
cuda_utils.get_binding_version()
    0.004149 seconds for 100000 iterations
    0.04 µs per call

In retrospect, I should have just looked at the get_bindings() implementation immediately.

The way I actually found this (perf version 6.8.12):

perf record -F 99 -g -- python test_slowness.py 100000
perf report

I gave the top of the perf report and the get_bindings() implementation to ChatGPT:

That made it immediately obvious that importlib.metadata.version("cuda-bindings") is the bottleneck, mainly because it involves regex calls, but also because it triggers filesystem calls.


import sys
import time

from cuda.bindings import driver
from cuda.core.experimental._utils import cuda_utils


class show_timings:
    def __init__(self, num_iters, label):
        self.label = label
        self.num_iters = num_iters

    def __enter__(self):
        self.start = time.perf_counter()
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        elapsed = time.perf_counter() - self.start
        print(self.label)
        if self.num_iters:
            print(f"    {elapsed:.6f} seconds for {self.num_iters} iterations")
            print(f"    {(elapsed / self.num_iters) * 1e6:.2f} µs per call")
        else:
            print(f"    {elapsed:.6f} seconds")


err, dv = driver.cuDriverGetVersion()
assert err == driver.CUresult.CUDA_SUCCESS
print("driver.cuDriverGetVersion()", dv, flush=True)

bv = cuda_utils.get_binding_version()
print("cuda_utils.get_binding_version()", bv, flush=True)

num_iters = int(sys.argv[1])

if 1:
    with show_timings(num_iters, "driver.cuDriverGetVersion()"):
        for _ in range(num_iters):
            driver.cuDriverGetVersion()
if 1:
    with show_timings(num_iters, "cuda_utils.get_binding_version()"):
        for _ in range(num_iters):
            cuda_utils.get_binding_version()

>>> 35.381859/0.004149
8527.804049168473

$ git stash
$ python test_slowness.py 100000
driver.cuDriverGetVersion() 12060
cuda_utils.get_binding_version() (12, 8)
driver.cuDriverGetVersion()
    0.023946 seconds for 100000 iterations
    0.24 µs per call
cuda_utils.get_binding_version()
    35.381859 seconds for 100000 iterations
    353.82 µs per call

$ git stash pop
$ python test_slowness.py 100000
driver.cuDriverGetVersion() 12060
cuda_utils.get_binding_version() (12, 8)
driver.cuDriverGetVersion()
    0.022644 seconds for 100000 iterations
    0.23 µs per call
cuda_utils.get_binding_version()
    0.004149 seconds for 100000 iterations
    0.04 µs per call
@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Mar 12, 2025

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@rwgk rwgk self-assigned this Mar 12, 2025
@rwgk
Copy link
Collaborator Author

rwgk commented Mar 12, 2025

/ok to test

@rwgk rwgk requested a review from ksimpson-work March 12, 2025 22:10
@rwgk
Copy link
Collaborator Author

rwgk commented Mar 12, 2025

@shwina for visibility

This PR does not fix #439

@github-actions

This comment has been minimized.

@leofang leofang added this to the cuda.core beta 3 milestone Mar 12, 2025
@leofang leofang added enhancement Any code-related improvements P0 High priority - Must do! cuda.core Everything related to the cuda.core module labels Mar 12, 2025
@rwgk rwgk merged commit f903d98 into NVIDIA:main Mar 12, 2025
76 checks passed
@leofang
Copy link
Member

leofang commented Mar 12, 2025

Thanks, Ralf! How did you notice the slowness?

@github-actions
Copy link

Doc Preview CI
Preview removed because the pull request was closed or merged.

@rwgk
Copy link
Collaborator Author

rwgk commented Mar 13, 2025

Thanks, Ralf! How did you notice the slowness?

When I was working on this:

c789bf6

Originally I had the _utils.get_binding_version() call inside the loop where you see if _BINDING_VERSION >= (12, 0):.

Today I was hoping fixing that very obvious problem first would help with #439 as well. But no, that's something different, and not nearly as extreme (5x vs 8500x).

@rwgk rwgk deleted the get_binding_version_8k branch March 13, 2025 03:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module enhancement Any code-related improvements P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants