Device not found error starting on 8/31 PyTorch nightlies #1472

ebsmothers · 2024-09-02T18:41:36Z

Filing this here because I cannot reproduce the issue in a PyTorch-only env, but I see it consistently when installing torchtune. The following works fine:

conda create -n pt-nightly-08-30 python=3.11
conda activate pt-nightly-08-30
# Install PyTorch nightly from 8/30
pip install --pre torch==2.5.0.dev20240830+cu121 --index-url https://download.pytorch.org/whl/nightly/cu121
# Normal torchtune install
pip install -e ".[dev]"
# Reinstall torchao due to incompatibility with nightly PyTorch (exact nightly version doesn't matter too much)
pip install --force-reinstall --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu121

# Download model and run any recipe
tune download meta-llama/Llama-2-7b-hf --output-dir /tmp/Llama-2-7b-hf 
tune run lora_finetune_single_device --config llama2/7B_qlora_single_device
...
1|1|Loss: 1.6810555458068848:   0%|                                                                                                                           | 1/1617 [00:15<7:04:51, 15.77s/it]

If we install the 8/31 nightly instead:

conda create -n pt-nightly-08-31 python=3.11
conda activate pt-nightly-08-31
# Install PyTorch nightly from 8/30
pip install --pre torch==2.5.0.dev20240831+cu121 --index-url https://download.pytorch.org/whl/nightly/cu121
# Normal torchtune install
pip install -e ".[dev]"
# Reinstall torchao due to incompatibility with nightly PyTorch (exact nightly version doesn't matter too much)
pip install --force-reinstall --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu121

# Download model and run any recipe
tune download meta-llama/Llama-2-7b-hf --output-dir /tmp/Llama-2-7b-hf 
tune run lora_finetune_single_device --config llama2/7B_qlora_single_device
...
  File "/data/users/ebs/ebs-torchtune/torchtune/utils/_device.py", line 96, in _validate_device_from_env
    raise RuntimeError(
RuntimeError: The device cuda:0 is not available on this machine.

The offending line seems to be torch.empty(0, device=torch.device('cuda:0')). If I run this in a Python interpreter things are even more interesting..

python3
>>> import torch
>>> torch.empty(0, device=torch.device('cuda:0'))
tensor([], device='cuda:0')

python3
>>> import torch
>>> import torchtune
>>> torch.empty(0, device=torch.device('cuda:0'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: CUDA error: operation not supported
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Also I've run the same ipython commands as far back as 6f37d15, so I don't think there are any recent breakages on our end that would've caused this.

The text was updated successfully, but these errors were encountered:

ebsmothers · 2024-09-02T19:11:46Z

Update: opened pytorch/ao#795 after pinpointing that the error occurs whenever we import ao's NF4Tensor, even without any torchtune imports

ebsmothers · 2024-09-10T23:16:19Z

This is resolved now (see the linked ao issue)

ebsmothers closed this as completed Sep 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Device not found error starting on 8/31 PyTorch nightlies #1472

Device not found error starting on 8/31 PyTorch nightlies #1472

ebsmothers commented Sep 2, 2024 •

edited

Loading

ebsmothers commented Sep 2, 2024

Uh oh!

ebsmothers commented Sep 10, 2024

Uh oh!

Device not found error starting on 8/31 PyTorch nightlies #1472

Device not found error starting on 8/31 PyTorch nightlies #1472

Comments

ebsmothers commented Sep 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ebsmothers commented Sep 2, 2024

Uh oh!

ebsmothers commented Sep 10, 2024

Uh oh!

ebsmothers commented Sep 2, 2024 •

edited

Loading