Skip to content

Device not found error starting on 8/31 PyTorch nightlies #1472

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ebsmothers opened this issue Sep 2, 2024 · 2 comments
Closed

Device not found error starting on 8/31 PyTorch nightlies #1472

ebsmothers opened this issue Sep 2, 2024 · 2 comments

Comments

@ebsmothers
Copy link
Contributor

ebsmothers commented Sep 2, 2024

Filing this here because I cannot reproduce the issue in a PyTorch-only env, but I see it consistently when installing torchtune. The following works fine:

conda create -n pt-nightly-08-30 python=3.11
conda activate pt-nightly-08-30
# Install PyTorch nightly from 8/30
pip install --pre torch==2.5.0.dev20240830+cu121 --index-url https://download.pytorch.org/whl/nightly/cu121
# Normal torchtune install
pip install -e ".[dev]"
# Reinstall torchao due to incompatibility with nightly PyTorch (exact nightly version doesn't matter too much)
pip install --force-reinstall --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu121

# Download model and run any recipe
tune download meta-llama/Llama-2-7b-hf --output-dir /tmp/Llama-2-7b-hf 
tune run lora_finetune_single_device --config llama2/7B_qlora_single_device
...
1|1|Loss: 1.6810555458068848:   0%|                                                                                                                           | 1/1617 [00:15<7:04:51, 15.77s/it]

If we install the 8/31 nightly instead:

conda create -n pt-nightly-08-31 python=3.11
conda activate pt-nightly-08-31
# Install PyTorch nightly from 8/30
pip install --pre torch==2.5.0.dev20240831+cu121 --index-url https://download.pytorch.org/whl/nightly/cu121
# Normal torchtune install
pip install -e ".[dev]"
# Reinstall torchao due to incompatibility with nightly PyTorch (exact nightly version doesn't matter too much)
pip install --force-reinstall --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu121

# Download model and run any recipe
tune download meta-llama/Llama-2-7b-hf --output-dir /tmp/Llama-2-7b-hf 
tune run lora_finetune_single_device --config llama2/7B_qlora_single_device
...
  File "/data/users/ebs/ebs-torchtune/torchtune/utils/_device.py", line 96, in _validate_device_from_env
    raise RuntimeError(
RuntimeError: The device cuda:0 is not available on this machine.

The offending line seems to be torch.empty(0, device=torch.device('cuda:0')). If I run this in a Python interpreter things are even more interesting..

python3
>>> import torch
>>> torch.empty(0, device=torch.device('cuda:0'))
tensor([], device='cuda:0')
python3
>>> import torch
>>> import torchtune
>>> torch.empty(0, device=torch.device('cuda:0'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: CUDA error: operation not supported
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Also I've run the same ipython commands as far back as 6f37d15, so I don't think there are any recent breakages on our end that would've caused this.

@ebsmothers
Copy link
Contributor Author

Update: opened pytorch/ao#795 after pinpointing that the error occurs whenever we import ao's NF4Tensor, even without any torchtune imports

@ebsmothers
Copy link
Contributor Author

This is resolved now (see the linked ao issue)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant