Skip to content

CUDA 11.8's syevd solver can cause an illegal memory access error when called through torch.linalg.eigh #655

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hjmshi opened this issue Feb 27, 2024 · 8 comments

Comments

@hjmshi
Copy link

hjmshi commented Feb 27, 2024

We are preparing a PyTorch submission for AlgoPerf that relies on the torch.linalg.eigh operator, which calls the linalg_eigh_cusolver_syevd solver from cuSOLVER. While running this operator with PyTorch 2.1.0 + CUDA 11.8, we have observed that it can create an illegal memory access error in our AlgoPerf runs. This failure is not recoverable.

Description

We have observed previous issues with CUDA 11.8 where the torch.linalg.eigh operator can create an CUDA illegal memory access error, which is unrecoverable; see, for example, pytorch/pytorch#105359 and pytorch/pytorch#94772 (comment).

We have now observed this problem arise in our experiments for AlgoPerf for the OGBG model:

W0226 19:28:24.702791 140677981660992 shampoo_preconditioner_list.py:629] Matrix inverse root computation failed for factor matrix 52.block_0.1 with exception CUDA error: an illegal memory access was encountered                                                                                                                                                                               
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.                                                                          
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.                                                                                                                                           
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.                                                                                                                              
. Using previous inv_factor_matrix and continuing...                                                                                                                                             
    timing, metrics = train_once(workload, workload_name,                                                                                                                                        
  File "submission_runner.py", line 336, in train_once                                                                                                                                           
    optimizer_state, model_params, model_state = update_params(                                                                                                                                  
  File "/algorithmic-efficiency/submissions/shampoo_submission/pytorch_shampoo.py", line 178, in update_params                                                                                   
    optimizer_state['optimizer'].step()                                                                                                                                                          
  File "/usr/local/lib/python3.8/dist-packages/torch/optim/lr_scheduler.py", line 68, in wrapper                                                                                                 
    return wrapped(*args, **kwargs)                                                                                                                                                              
  File "/usr/local/lib/python3.8/dist-packages/torch/optim/optimizer.py", line 373, in wrapper                                                                                                   
    out = func(*args, **kwargs)                                                                                                                                                                  
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context                                                                                        
    return func(*args, **kwargs)                                                                                                                                                                 
  File "/algorithmic-efficiency/submissions/shampoo_submission/optimizers/distributed_shampoo/distributed_shampoo.py", line 905, in step                                                         
    self._per_group_step(                                                                                                                                                                        
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context                                                                                        
    return func(*args, **kwargs)                                                                                                                                                                 
  File "/algorithmic-efficiency/submissions/shampoo_submission/optimizers/distributed_shampoo/distributed_shampoo.py", line 753, in _per_group_step_impl                                         
W0226 19:28:24.704183 140547811231552 matrix_functions.py:218] Failed to compute eigendecomposition in torch.float32 precision with exception cusolver error: CUSOLVER_STATUS_EXECUTION_FAILED, w
hen calling `cusolverDnXsyevd( handle, params, jobz, uplo, n, CUDA_R_32F, reinterpret_cast<void*>(A), lda, CUDA_R_32F, reinterpret_cast<void*>(W), CUDA_R_32F, reinterpret_cast<void*>(bufferOnDe
vice), workspaceInBytesOnDevice, reinterpret_cast<void*>(bufferOnHost), workspaceInBytesOnHost, info)`. This error may appear if the input matrix contains NaN. If you keep seeing this error, yo
u may use `torch.backends.cuda.preferred_linalg_library()` to try linear algebra operators with other supported backends. See https://pytorch.org/docs/stable/backends.html#torch.backends.cuda.p
referred_linalg_library! Retrying in double precision...                                                                                                                                         
    torch._foreach_mul_(state_lists[MASKED_FILTERED_GRAD_LIST], beta1)                                                                                                                           
RuntimeError: CUDA error: an illegal memory access was encountered                                                                                                                               
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.                                                                          
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.                                                                                                                                           
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.                                                                                                                              

Notice that in this case, we are aiming to bypass the error (which is caught by our script), but then consecutive CUDA kernels also lead to illegal memory access.

Consistent with the Github issues posted above, we have checked that the cuSOLVER version is /usr/local/cuda/lib64/libcusolver.so.11.4.1.48, which is the problematic solver.

Steps to Reproduce

Follow the steps in pytorch/pytorch#105359 (comment).

Source or Possible Fix

If possible, we would suggest using CUDA 12.1.1 instead of CUDA 11.8 for the contest in order to avoid these instabilities.

Is there a way that we can change the Dockerfile to use PyTorch 12.1.0 with CUDA 12.1.1? (We have tried changing the first line in the Dockerfile to docker.io/nvidia/cuda:12.1.1-cudnn8-devel-ubuntu20.04, but we still observe PyTorch 12.1 + CUDA 11.8 being used when calling torch.__version__ and torch.version.cuda.)

cc @anana10c @mikerabbat @tsunghsienlee @yuchenhao @shintaro-iwasaki

@priyakasimbeg
Copy link
Contributor

priyakasimbeg commented Feb 27, 2024

Hi! I think we can explore upgrading the PyTorch CUDA versions, given that this is blocking your submission.
Note that the PyTorch packages ship with their own CUDA runtimes, in this case 2.1.0+cu118 uses CUDA 11.8, regardless of what CUDA version is installed in the local environment.

So it is in line with expectations that changing the CUDA version in the Docker container won't change anything. We've just pinned the local CUDA to that version for consistency between the CUDA versions JAX and PyTorch are using.

It seems like it is possible to use local CUDA with PyTorch, see this discussion. Could you check if the proposed solution on that discussion works? If we can install it with the Docker images so that it will generalize we can probably upgrade.

@msaroufim do you have any tips on installing PyTorch 2.1.0 with CUDA 12.1?
Alternatively, @mikerabbat @msaroufim since this issue seems to be contained within the PyTorch installation is it possible to get someone from PyTorch to look at this bug or release PyTorch with CUDA 12.1?

@msaroufim
Copy link
Member

msaroufim commented Feb 27, 2024

Thankfully this has a 1 line fix, you need to pass in the index-url and you can cntrl+f the index url to see which combination is available

(test) ubuntu@ip-172-31-41-234:~$ pip3 install torch==2.1.0 --index-url https://download.pytorch.org/whl/cu121
Looking in indexes: https://download.pytorch.org/whl/cu121
Collecting torch==2.1.0
  Downloading https://download.pytorch.org/whl/cu121/torch-2.1.0%2Bcu121-cp310-cp310-linux_x86_64.whl (2200.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━ 1.8/2.2 GB 142.4 MB/s eta 0:00:04

This usually works fine for most non ancient versions of PyTorch but for those you can install the cudatoolkit using conda and then build pytorch from source which will pick whatever cudatoolkit version you have installed. This can be slow on machines with a small number of CPU cores though

@priyakasimbeg
Copy link
Contributor

priyakasimbeg commented Feb 27, 2024

Awesome, thank you @msaroufim!

@hjmshi can you confirm whether upgrading with the above procedure resolves the issue with linalg.eigh?

@hjmshi
Copy link
Author

hjmshi commented Feb 27, 2024

Thanks @priyakasimbeg and @msaroufim! This change works on our side. We essentially modify the Dockerfile to the following:

# To build Docker image
FROM nvidia/cuda:12.1.1-cudnn8-devel-ubuntu20.04

...

RUN if [ "$framework" = "jax" ] ; then \
        echo "Installing Jax GPU" \
        && cd /algorithmic-efficiency \
        && pip install -e '.[jax_gpu]' -f 'https://storage.googleapis.com/jax-releases/jax_cuda_releases.html' \
        && pip install -e '.[pytorch_cpu]' -f 'https://download.pytorch.org/whl/torch_stable.html'; \
    elif [ "$framework" = "pytorch" ] ; then \
        echo "Installing Pytorch GPU" \
        && cd /algorithmic-efficiency \
        && pip install -e '.[jax_cpu]' \
        && pip3 install torch==2.1.0 -f 'https://download.pytorch.org/whl/cu121'; \
    elif [ "$framework" = "both" ] ; then \
        echo "Installing Jax GPU and Pytorch GPU" \
        && cd /algorithmic-efficiency \
        && pip install -e '.[jax_gpu]' -f 'https://storage.googleapis.com/jax-releases/jax_cuda_releases.html' \
        && pip3 install torch==2.1.0 -f 'https://download.pytorch.org/whl/cu121'; \
    else \
        echo "Invalid build-arg $framework: framework should be either jax, pytorch or both." >&2 \
        && exit 1 ; \
    fi

...

Just confirming that this is the change you both had in mind? 😄

@priyakasimbeg
Copy link
Contributor

priyakasimbeg commented Feb 27, 2024

Yes almost.
Ideally we'd want to change the version 2.1.0+cu118 to 2.1.0 in the setup.cfg so that the only change to the Dockerfile is the index URL and base image change.
But I can do that on my end.

@hjmshi
Copy link
Author

hjmshi commented Feb 27, 2024

Got it, makes sense! Thanks @priyakasimbeg!

@priyakasimbeg
Copy link
Contributor

Just updating this thread. We're currently testing the workloads with the new CUDA version and JAX and PyTorch installations.

@priyakasimbeg
Copy link
Contributor

Merged CUDA upgrade into dev #659

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants