Skip to content

EPIC: Path finder for CUDA components #451

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
leofang opened this issue Feb 14, 2025 · 14 comments
Open

EPIC: Path finder for CUDA components #451

leofang opened this issue Feb 14, 2025 · 14 comments
Assignees
Labels
cuda.bindings Everything related to the cuda.bindings module EPIC Soul of a release feature New feature or request P1 Medium priority - Should do

Comments

@leofang
Copy link
Member

leofang commented Feb 14, 2025

In 2025, there are many ways of installing CUDA to a Python environment. One key challenge here is that all header/library search logics implemented in the existing CUDA-enabled libraries (ex: #447) need to be modernized, taking into account that CUDA these days can be installed on a per-component basis (ex: I just want NVRTC and CCCL and nothing else). The consequence is that any prior arts that rely on checking if a certain piece exists (ex: nvcc, cuda.h, nvvm, ...) and generalizing it to assume the whole Toolkit exists based on known relative paths are no longer valid. Even Linux system package managers may not always behave as expected. (Though setting CUDA_HOME/CUDA_PATH as a fallback might still be OK.)

The CUDA Python team is well-positioned to take on the pain points so that all other Python libraries do not need to worry about packaging sources, layouts, and so on. It is our intention to support modern CUDA packages and deployment options in a JIT-compilation friendly way. What this means is that we should be able to return, on a per-component basis,

  • where are the component headers?
  • where are the component shared libraries?
  • ...

Something like (API design TBD)

from cuda.core.utils import CUDALocater

locater = CUDALocater()
nvcc_incl = locater.nvcc.include  # returns a list of valid abs paths to the include directories, or None 
cccl_incl = locater.cccl.include  # returns a list of valid abs paths to the include directories, or None
nvrtc_lib = locater.nvrtc.lib     # returns a list of valid abs paths to the shared libraries, or None 
...

This needs to cover

  • CUDA installed via various package managers (apt, yum, conda, pip, ...)
  • Headers and shared libraries as bare minimum
    • From JIT compilation perspective, headers are considered a kind of shared libraries
  • Linux and Windows
  • Default system search paths, if possible
    • This includes the "legacy" CTK locations, such as /usr/local/cuda on Linux, as a fallback
  • All CTK components relevant to Python users, such as:
    • nvcc/nvvm
      • this includes libdevice.bc
    • nvrtc
    • nvjitlink
    • cublas
    • cusolver
    • curand
    • cufft
    • cusparse
    • ...

Once completed, this would also help us unify the treatment of loading shared libraries in cuda.bindings, which is currently divergent between Linux/Windows:

  • Linux: hack RPATH and rely on dynamic loader (ld.so)
  • Windows: search possible DLL locations (site-packages, ...)
@github-actions github-actions bot added the triage Needs the team's attention label Feb 14, 2025
@leofang
Copy link
Member Author

leofang commented Feb 14, 2025

cc @rwgk

@leofang leofang added feature New feature or request cuda.bindings Everything related to the cuda.bindings module cuda.core Everything related to the cuda.core module EPIC Soul of a release labels Feb 14, 2025
@leofang leofang changed the title Path finder for CUDA components EPIC: Path finder for CUDA components Feb 14, 2025
@leofang
Copy link
Member Author

leofang commented Feb 18, 2025

I expect our path finder is enough for these projects to drop the following code

@leofang leofang added P1 Medium priority - Should do and removed triage Needs the team's attention labels Feb 18, 2025
@leofang
Copy link
Member Author

leofang commented Feb 18, 2025

Another question we need to answer is: In which module (binding or core) should we place the path finder? This seems like a high-level pythonic helper that is suitable for cuda.core, but cuda.bindings would need the same info for loading modules if we pull this off. I don't have an answer.

@leofang
Copy link
Member Author

leofang commented Feb 18, 2025

(discussed offline, tentatively slate this for beta 3, with the understanding that we might not make it)

@leofang leofang added this to the cuda.core beta 3 milestone Feb 18, 2025
@leofang
Copy link
Member Author

leofang commented Feb 19, 2025

cc @cryos for vis (since you're also working on wheels)

@NVIDIA NVIDIA deleted a comment from rwgk Feb 26, 2025
@leofang
Copy link
Member Author

leofang commented Feb 26, 2025

@leofang
Copy link
Member Author

leofang commented Mar 12, 2025

The consequence is that any prior arts that rely on checking if a certain piece exists (ex: nvcc, cuda.h, nvvm, ...) and generalizing it to assume the whole Toolkit exists based on known relative paths are no longer valid. Even Linux system package managers may not always behave as expected.

Keith gave an expanded explanation on what's described in the epic body: #441 (comment).

@rwgk
Copy link
Collaborator

rwgk commented Mar 13, 2025

Tracking a related numba.cuda PR, for easy reference: NVIDIA/numba-cuda#155

@rwgk
Copy link
Collaborator

rwgk commented Mar 20, 2025

@leofang @kkraus14

  • I expanded my experiment under #447 to move the entire numba/cuda/cuda_paths.py — not just the part that locates libnvvm — into cuda-bindings. It turns out to be very easy.

  • I illustrated the approach here.

It seems very straightforward to me. It'd be great to discuss.

@rwgk
Copy link
Collaborator

rwgk commented Apr 15, 2025

As of 2025-04-15 (808074d):

These .so files exist under /usr/local/cuda-12.8/ (Linux x86_64 CTK 12.8.1) but are not supported by cuda.bindings.path_finder:

/usr/local/cuda-12.8/version.json
   "cuda" : {
      "name" : "CUDA SDK",
      "version" : "12.8.1"
   }
/usr/local/cuda-12.8/lib64/
    libaccinj64.so
    libcheckpoint.so
    libcuinj64.so
    libcupti.so
    libnvperf_host.so
    libnvperf_target.so
    libnvToolsExt.so
    libOpenCL.so
    libpcsamplingutil.so

These Windows .dll files are under https://developer.download.nvidia.com/compute/cuda/redist/ but are not supported by cuda.bindings.path_finder:

cuinj64_128.dll
cuinj64_126.dll
cuinj64_125.dll
cuinj64_124.dll
cuinj64_123.dll
cuinj64_122.dll
cuinj64_121.dll
cuinj64_120.dll
cuinj64_118.dll
cuinj64_117.dll
cuinj64_116.dll
cuinj64_115.dll
cuinj64_114.dll

@rwgk
Copy link
Collaborator

rwgk commented Apr 16, 2025

I'm familiarizing myself with the content of https://developer.download.nvidia.com/compute/cuda/redist/ (to learn what .so and .dll files we have).

A small side product:

cuda/redist Matrix

component 11.0.3 11.1.1 11.2.0 11.2.1 11.2.2 11.3.0 11.3.1 11.4.0 11.4.1 11.4.2 11.4.3 11.4.4 11.5.0 11.5.1 11.5.2 11.6.0 11.6.1 11.6.2 11.7.0 11.7.1 11.8.0 12.0.0 12.0.1 12.1.0 12.1.1 12.2.0 12.2.1 12.2.2 12.3.0 12.3.1 12.3.2 12.4.0 12.4.1 12.5.0 12.5.1 12.6.0 12.6.1 12.6.2 12.6.3 12.8.0 12.8.1
cuda_cccl
cuda_compat
cuda_cudart
cuda_cuobjdump
cuda_cupti
cuda_cuxxfilt
cuda_demo_suite
cuda_documentation
cuda_gdb
cuda_memcheck
cuda_nsight
cuda_nvcc
cuda_nvdisasm
cuda_nvml_dev
cuda_nvprof
cuda_nvprune
cuda_nvrtc
cuda_nvtx
cuda_nvvp
cuda_opencl
cuda_profiler_api
cuda_sandbox_dev
cuda_sanitizer_api
driver_assistant
fabricmanager
imex
libcublas
libcudla
libcufft
libcufile
libcurand
libcusolver
libcusparse
libnpp
libnvfatbin
libnvidia_nscq
libnvjitlink
libnvjpeg
libnvsdm
libnvvm_samples
nsight_compute
nsight_nvtx
nsight_systems
nsight_vse
nvidia_driver
nvidia_fs
release_date
release_label
release_product
visual_studio_integration

@rwgk
Copy link
Collaborator

rwgk commented Apr 16, 2025

Visual overview of shared library dependencies (GraphViz)

These were generated with:

@leofang leofang removed the cuda.core Everything related to the cuda.core module label Apr 18, 2025
@rwgk
Copy link
Collaborator

rwgk commented Apr 23, 2025

Tracking a key insight for easy future reference:

I've verified that all CUDA libraries in version 12.8.1 (x86_64) have their SONAME set (see tiny script below).​

Assuming this is the case for all 12.x releases, and future releases, this means we can reliably check if a shared library is already loaded by using the known SONAMEs​, e.g.:

import ctypes
import os

try:
    handle = ctypes.CDLL("libnvvm.so.4", mode=os.RTLD_NOLOAD)
    print("Library is already loaded.")
except OSError:
    print("Library is not loaded yet.")

According to ChatGPT, "this method is effective for standard system libraries and well-maintained third-party libraries that follow proper versioning practices.​"

Full ChatGPT chat (very long)

Script used to inspect SONAMES under /usr/local/cuda:

find_sonames.sh:

#!/bin/bash
find . -type f -name '*.so*' -print0 | while IFS= read -r -d '' f; do
  type=$(test -L "$f" && echo SYMLINK || echo FILE)
  soname=$(readelf -d "$f" 2>/dev/null | awk '/SONAME/ {gsub(/[][]/, "", $5); print $5; exit}')
  echo "$f $type ${soname:-SONAME_NOT_SET}"
done

@rwgk
Copy link
Collaborator

rwgk commented Apr 23, 2025

Summary of .so files that do NOT have SONAME set, in these releases:

cuda_11.0.3_450.51.06_linux.run
cuda_11.1.1_455.32.00_linux.run
cuda_11.2.2_460.32.03_linux.run
cuda_11.3.1_465.19.01_linux.run
cuda_11.4.4_470.82.01_linux.run
cuda_11.5.1_495.29.05_linux.run
cuda_11.6.2_510.47.03_linux.run
cuda_11.7.1_515.65.01_linux.run
cuda_11.8.0_520.61.05_linux.run
cuda_12.0.1_525.85.12_linux.run
cuda_12.1.1_530.30.02_linux.run
cuda_12.2.2_535.104.05_linux.run
cuda_12.3.2_545.23.08_linux.run
cuda_12.4.1_550.54.15_linux.run
cuda_12.5.1_555.42.06_linux.run
cuda_12.6.2_560.35.03_linux.run
cuda_12.8.0_570.86.10_linux.run

The first number is the count across all releases:

$ cat soname_not_set_110_through_128.txt
     17 eclipse_1605.so
     21 libbradient.so
      4 libdmabuf-server.so
      4 libdrm-egl-server.so
     21 libfullscreen-shell-v1.so
     30 libGL.so.1.5.0
     21 libivi-shell.so
     19 libqcertonlybackend.so
     34 libqgif.so
     34 libqico.so
     34 libqjpeg.so
     25 libqoffscreen.so
     19 libqopensslbackend.so
     34 libqsvg.so
     34 libqtga.so
     34 libqtiff.so
     21 libqt-plugin-wayland-egl.so
     21 libqwayland-egl.so
     21 libqwayland-generic.so
      4 libqwayland-xcomposite-egl.so
      4 libqwayland-xcomposite-glx.so
     34 libqwbmp.so
     34 libqxcb-glx-integration.so
     34 libqxcb.so
     21 libshm-emulation-server.so
     21 libvulkan-server.so
     17 libwl-shell-plugin.so
      4 libwl-shell.so
      4 libxcomposite-egl.so
      4 libxcomposite-glx.so
     21 libxdg-shell.so
      4 libxdg-shell-v5.so
      4 libxdg-shell-v6.so
      5 _ncu_report.so
     10 _sqlite3.cpython-310-x86_64-linux-gnu.so
      4 _sqlite3.cpython-312-x86_64-linux-gnu.so

Commands used:

cd extracted
find_sonames.sh > ../all_SONAME.txt
grep 'FILE SONAME_NOT_SET' all_SONAME.txt | grep -v /cuda_documentation/ | rev | cut -d/ -f1 | rev | sed 's/ FILE SONAME_NOT_SET$//' | sort | uniq -c

NOTE: The extracted CTK directories have no symlinks (unlike "installed" CTK directories).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda.bindings Everything related to the cuda.bindings module EPIC Soul of a release feature New feature or request P1 Medium priority - Should do
Projects
None yet
Development

No branches or pull requests

2 participants