-
Notifications
You must be signed in to change notification settings - Fork 72
CUDA 11.8's syevd
solver can cause an illegal memory access error when called through torch.linalg.eigh
#655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi! I think we can explore upgrading the PyTorch CUDA versions, given that this is blocking your submission. So it is in line with expectations that changing the CUDA version in the Docker container won't change anything. We've just pinned the local CUDA to that version for consistency between the CUDA versions JAX and PyTorch are using. It seems like it is possible to use local CUDA with PyTorch, see this discussion. Could you check if the proposed solution on that discussion works? If we can install it with the Docker images so that it will generalize we can probably upgrade. @msaroufim do you have any tips on installing PyTorch 2.1.0 with CUDA 12.1? |
Thankfully this has a 1 line fix, you need to pass in the index-url and you can cntrl+f the index url to see which combination is available
This usually works fine for most non ancient versions of PyTorch but for those you can install the cudatoolkit using conda and then build pytorch from source which will pick whatever cudatoolkit version you have installed. This can be slow on machines with a small number of CPU cores though |
Awesome, thank you @msaroufim! @hjmshi can you confirm whether upgrading with the above procedure resolves the issue with |
Thanks @priyakasimbeg and @msaroufim! This change works on our side. We essentially modify the
Just confirming that this is the change you both had in mind? 😄 |
Yes almost. |
Got it, makes sense! Thanks @priyakasimbeg! |
Just updating this thread. We're currently testing the workloads with the new CUDA version and JAX and PyTorch installations. |
Merged CUDA upgrade into dev #659 |
We are preparing a PyTorch submission for AlgoPerf that relies on the
torch.linalg.eigh
operator, which calls thelinalg_eigh_cusolver_syevd
solver from cuSOLVER. While running this operator with PyTorch 2.1.0 + CUDA 11.8, we have observed that it can create an illegal memory access error in our AlgoPerf runs. This failure is not recoverable.Description
We have observed previous issues with CUDA 11.8 where the
torch.linalg.eigh
operator can create an CUDA illegal memory access error, which is unrecoverable; see, for example, pytorch/pytorch#105359 and pytorch/pytorch#94772 (comment).We have now observed this problem arise in our experiments for AlgoPerf for the OGBG model:
Notice that in this case, we are aiming to bypass the error (which is caught by our script), but then consecutive CUDA kernels also lead to illegal memory access.
Consistent with the Github issues posted above, we have checked that the cuSOLVER version is
/usr/local/cuda/lib64/libcusolver.so.11.4.1.48
, which is the problematic solver.Steps to Reproduce
Follow the steps in pytorch/pytorch#105359 (comment).
Source or Possible Fix
If possible, we would suggest using CUDA 12.1.1 instead of CUDA 11.8 for the contest in order to avoid these instabilities.
Is there a way that we can change the
Dockerfile
to use PyTorch 12.1.0 with CUDA 12.1.1? (We have tried changing the first line in theDockerfile
todocker.io/nvidia/cuda:12.1.1-cudnn8-devel-ubuntu20.04
, but we still observe PyTorch 12.1 + CUDA 11.8 being used when callingtorch.__version__
andtorch.version.cuda
.)cc @anana10c @mikerabbat @tsunghsienlee @yuchenhao @shintaro-iwasaki
The text was updated successfully, but these errors were encountered: