Querying current device is slow compared to CuPy

Getting the current device using `cuda.core` is quite a bit slower than CuPy:

```python
In [1]: import cupy as cp

In [2]: %timeit cp.cuda.Device()
69 ns ± 0.496 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [3]: from cuda.core.experimental import Device

In [4]: %timeit Device()
795 ns ± 0.273 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
```

Ultimately, my goal is to get the compute capability of the current device, and this is even slower:

```python
In [5]: %timeit cp.cuda.Device().compute_capability
89.1 ns ± 0.413 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [6]: %timeit Device().compute_capability
2.64 μs ± 122 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
```

Are there tricks (e.g., caching) CuPy is employing here that `cuda.core` can use as well? Alternately, is there another way for me to use `cuda.core` or `cuda.bindings` to get this information quickly? Note that for my use case, I'm not super concerned about the _first_ call to `Device()`, but I do want _subsequent_ calls to be trivially inexpensive if the current device hasn't changed.

---

Using the low-level cuda.bindings is also not quite as fast:

```python
In [11]: def get_cc():
    ...:     dev = runtime.cudaGetDevice()[1]
    ...:     return driver.cuDeviceComputeCapability(dev)
    ...:

In [12]: get_cc()
Out[12]: (<CUresult.CUDA_SUCCESS: 0>, 7, 5)

In [13]: %timeit get_cc()
597 ns ± 0.494 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Querying current device is slow compared to CuPy #439

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Querying current device is slow compared to CuPy #439

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions