CUDA 11.8's `syevd` solver can cause an illegal memory access error when called through `torch.linalg.eigh` #655

hjmshi · 2024-02-27T00:32:21Z

We are preparing a PyTorch submission for AlgoPerf that relies on the torch.linalg.eigh operator, which calls the linalg_eigh_cusolver_syevd solver from cuSOLVER. While running this operator with PyTorch 2.1.0 + CUDA 11.8, we have observed that it can create an illegal memory access error in our AlgoPerf runs. This failure is not recoverable.

Description

We have observed previous issues with CUDA 11.8 where the torch.linalg.eigh operator can create an CUDA illegal memory access error, which is unrecoverable; see, for example, pytorch/pytorch#105359 and pytorch/pytorch#94772 (comment).

We have now observed this problem arise in our experiments for AlgoPerf for the OGBG model:

W0226 19:28:24.702791 140677981660992 shampoo_preconditioner_list.py:629] Matrix inverse root computation failed for factor matrix 52.block_0.1 with exception CUDA error: an illegal memory access was encountered                                                                                                                                                                               
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.                                                                          
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.                                                                                                                                           
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.                                                                                                                              
. Using previous inv_factor_matrix and continuing...                                                                                                                                             
    timing, metrics = train_once(workload, workload_name,                                                                                                                                        
  File "submission_runner.py", line 336, in train_once                                                                                                                                           
    optimizer_state, model_params, model_state = update_params(                                                                                                                                  
  File "/algorithmic-efficiency/submissions/shampoo_submission/pytorch_shampoo.py", line 178, in update_params                                                                                   
    optimizer_state['optimizer'].step()                                                                                                                                                          
  File "/usr/local/lib/python3.8/dist-packages/torch/optim/lr_scheduler.py", line 68, in wrapper                                                                                                 
    return wrapped(*args, **kwargs)                                                                                                                                                              
  File "/usr/local/lib/python3.8/dist-packages/torch/optim/optimizer.py", line 373, in wrapper                                                                                                   
    out = func(*args, **kwargs)                                                                                                                                                                  
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context                                                                                        
    return func(*args, **kwargs)                                                                                                                                                                 
  File "/algorithmic-efficiency/submissions/shampoo_submission/optimizers/distributed_shampoo/distributed_shampoo.py", line 905, in step                                                         
    self._per_group_step(                                                                                                                                                                        
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context                                                                                        
    return func(*args, **kwargs)                                                                                                                                                                 
  File "/algorithmic-efficiency/submissions/shampoo_submission/optimizers/distributed_shampoo/distributed_shampoo.py", line 753, in _per_group_step_impl                                         
W0226 19:28:24.704183 140547811231552 matrix_functions.py:218] Failed to compute eigendecomposition in torch.float32 precision with exception cusolver error: CUSOLVER_STATUS_EXECUTION_FAILED, w
hen calling `cusolverDnXsyevd( handle, params, jobz, uplo, n, CUDA_R_32F, reinterpret_cast<void*>(A), lda, CUDA_R_32F, reinterpret_cast<void*>(W), CUDA_R_32F, reinterpret_cast<void*>(bufferOnDe
vice), workspaceInBytesOnDevice, reinterpret_cast<void*>(bufferOnHost), workspaceInBytesOnHost, info)`. This error may appear if the input matrix contains NaN. If you keep seeing this error, yo
u may use `torch.backends.cuda.preferred_linalg_library()` to try linear algebra operators with other supported backends. See https://pytorch.org/docs/stable/backends.html#torch.backends.cuda.p
referred_linalg_library! Retrying in double precision...                                                                                                                                         
    torch._foreach_mul_(state_lists[MASKED_FILTERED_GRAD_LIST], beta1)                                                                                                                           
RuntimeError: CUDA error: an illegal memory access was encountered                                                                                                                               
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.                                                                          
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.                                                                                                                                           
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Notice that in this case, we are aiming to bypass the error (which is caught by our script), but then consecutive CUDA kernels also lead to illegal memory access.

Consistent with the Github issues posted above, we have checked that the cuSOLVER version is /usr/local/cuda/lib64/libcusolver.so.11.4.1.48, which is the problematic solver.

Steps to Reproduce

Follow the steps in pytorch/pytorch#105359 (comment).

Source or Possible Fix

If possible, we would suggest using CUDA 12.1.1 instead of CUDA 11.8 for the contest in order to avoid these instabilities.

Is there a way that we can change the Dockerfile to use PyTorch 12.1.0 with CUDA 12.1.1? (We have tried changing the first line in the Dockerfile to docker.io/nvidia/cuda:12.1.1-cudnn8-devel-ubuntu20.04, but we still observe PyTorch 12.1 + CUDA 11.8 being used when calling torch.__version__ and torch.version.cuda.)

cc @anana10c @mikerabbat @tsunghsienlee @yuchenhao @shintaro-iwasaki

The text was updated successfully, but these errors were encountered:

priyakasimbeg · 2024-02-27T01:22:19Z

Hi! I think we can explore upgrading the PyTorch CUDA versions, given that this is blocking your submission.
Note that the PyTorch packages ship with their own CUDA runtimes, in this case 2.1.0+cu118 uses CUDA 11.8, regardless of what CUDA version is installed in the local environment.

So it is in line with expectations that changing the CUDA version in the Docker container won't change anything. We've just pinned the local CUDA to that version for consistency between the CUDA versions JAX and PyTorch are using.

It seems like it is possible to use local CUDA with PyTorch, see this discussion. Could you check if the proposed solution on that discussion works? If we can install it with the Docker images so that it will generalize we can probably upgrade.

@msaroufim do you have any tips on installing PyTorch 2.1.0 with CUDA 12.1?
Alternatively, @mikerabbat @msaroufim since this issue seems to be contained within the PyTorch installation is it possible to get someone from PyTorch to look at this bug or release PyTorch with CUDA 12.1?

msaroufim · 2024-02-27T02:20:05Z

Thankfully this has a 1 line fix, you need to pass in the index-url and you can cntrl+f the index url to see which combination is available

(test) ubuntu@ip-172-31-41-234:~$ pip3 install torch==2.1.0 --index-url https://download.pytorch.org/whl/cu121
Looking in indexes: https://download.pytorch.org/whl/cu121
Collecting torch==2.1.0
  Downloading https://download.pytorch.org/whl/cu121/torch-2.1.0%2Bcu121-cp310-cp310-linux_x86_64.whl (2200.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━ 1.8/2.2 GB 142.4 MB/s eta 0:00:04

This usually works fine for most non ancient versions of PyTorch but for those you can install the cudatoolkit using conda and then build pytorch from source which will pick whatever cudatoolkit version you have installed. This can be slow on machines with a small number of CPU cores though

priyakasimbeg · 2024-02-27T19:47:19Z

Awesome, thank you @msaroufim!

@hjmshi can you confirm whether upgrading with the above procedure resolves the issue with linalg.eigh?

hjmshi · 2024-02-27T21:07:35Z

Thanks @priyakasimbeg and @msaroufim! This change works on our side. We essentially modify the Dockerfile to the following:

# To build Docker image
FROM nvidia/cuda:12.1.1-cudnn8-devel-ubuntu20.04

...

RUN if [ "$framework" = "jax" ] ; then \
        echo "Installing Jax GPU" \
        && cd /algorithmic-efficiency \
        && pip install -e '.[jax_gpu]' -f 'https://storage.googleapis.com/jax-releases/jax_cuda_releases.html' \
        && pip install -e '.[pytorch_cpu]' -f 'https://download.pytorch.org/whl/torch_stable.html'; \
    elif [ "$framework" = "pytorch" ] ; then \
        echo "Installing Pytorch GPU" \
        && cd /algorithmic-efficiency \
        && pip install -e '.[jax_cpu]' \
        && pip3 install torch==2.1.0 -f 'https://download.pytorch.org/whl/cu121'; \
    elif [ "$framework" = "both" ] ; then \
        echo "Installing Jax GPU and Pytorch GPU" \
        && cd /algorithmic-efficiency \
        && pip install -e '.[jax_gpu]' -f 'https://storage.googleapis.com/jax-releases/jax_cuda_releases.html' \
        && pip3 install torch==2.1.0 -f 'https://download.pytorch.org/whl/cu121'; \
    else \
        echo "Invalid build-arg $framework: framework should be either jax, pytorch or both." >&2 \
        && exit 1 ; \
    fi

...

Just confirming that this is the change you both had in mind? 😄

priyakasimbeg · 2024-02-27T21:49:42Z

Yes almost.
Ideally we'd want to change the version 2.1.0+cu118 to 2.1.0 in the setup.cfg so that the only change to the Dockerfile is the index URL and base image change.
But I can do that on my end.

hjmshi · 2024-02-27T22:45:07Z

Got it, makes sense! Thanks @priyakasimbeg!

priyakasimbeg · 2024-03-03T20:22:43Z

Just updating this thread. We're currently testing the workloads with the new CUDA version and JAX and PyTorch installations.

priyakasimbeg · 2024-03-07T21:09:25Z

Merged CUDA upgrade into dev #659

priyakasimbeg closed this as completed Mar 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA 11.8's `syevd` solver can cause an illegal memory access error when called through `torch.linalg.eigh` #655

CUDA 11.8's `syevd` solver can cause an illegal memory access error when called through `torch.linalg.eigh` #655

hjmshi commented Feb 27, 2024

priyakasimbeg commented Feb 27, 2024 •

edited

Loading

Uh oh!

msaroufim commented Feb 27, 2024 •

edited

Loading

Uh oh!

priyakasimbeg commented Feb 27, 2024 •

edited

Loading

Uh oh!

hjmshi commented Feb 27, 2024

Uh oh!

priyakasimbeg commented Feb 27, 2024 •

edited

Loading

Uh oh!

hjmshi commented Feb 27, 2024

Uh oh!

priyakasimbeg commented Mar 3, 2024

Uh oh!

priyakasimbeg commented Mar 7, 2024

Uh oh!

CUDA 11.8's syevd solver can cause an illegal memory access error when called through torch.linalg.eigh #655

CUDA 11.8's syevd solver can cause an illegal memory access error when called through torch.linalg.eigh #655

Comments

hjmshi commented Feb 27, 2024

Description

Steps to Reproduce

Source or Possible Fix

priyakasimbeg commented Feb 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

msaroufim commented Feb 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

priyakasimbeg commented Feb 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hjmshi commented Feb 27, 2024

Uh oh!

priyakasimbeg commented Feb 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hjmshi commented Feb 27, 2024

Uh oh!

priyakasimbeg commented Mar 3, 2024

Uh oh!

priyakasimbeg commented Mar 7, 2024

Uh oh!

CUDA 11.8's `syevd` solver can cause an illegal memory access error when called through `torch.linalg.eigh` #655

CUDA 11.8's `syevd` solver can cause an illegal memory access error when called through `torch.linalg.eigh` #655

priyakasimbeg commented Feb 27, 2024 •

edited

Loading

msaroufim commented Feb 27, 2024 •

edited

Loading

priyakasimbeg commented Feb 27, 2024 •

edited

Loading

priyakasimbeg commented Feb 27, 2024 •

edited

Loading