You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Filing this here because I cannot reproduce the issue in a PyTorch-only env, but I see it consistently when installing torchtune. The following works fine:
conda create -n pt-nightly-08-30 python=3.11
conda activate pt-nightly-08-30
# Install PyTorch nightly from 8/30
pip install --pre torch==2.5.0.dev20240830+cu121 --index-url https://download.pytorch.org/whl/nightly/cu121
# Normal torchtune install
pip install -e ".[dev]"
# Reinstall torchao due to incompatibility with nightly PyTorch (exact nightly version doesn't matter too much)
pip install --force-reinstall --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu121
# Download model and run any recipe
tune download meta-llama/Llama-2-7b-hf --output-dir /tmp/Llama-2-7b-hf
tune run lora_finetune_single_device --config llama2/7B_qlora_single_device
...
1|1|Loss: 1.6810555458068848: 0%| | 1/1617 [00:15<7:04:51, 15.77s/it]
If we install the 8/31 nightly instead:
conda create -n pt-nightly-08-31 python=3.11
conda activate pt-nightly-08-31
# Install PyTorch nightly from 8/30
pip install --pre torch==2.5.0.dev20240831+cu121 --index-url https://download.pytorch.org/whl/nightly/cu121
# Normal torchtune install
pip install -e ".[dev]"
# Reinstall torchao due to incompatibility with nightly PyTorch (exact nightly version doesn't matter too much)
pip install --force-reinstall --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu121
# Download model and run any recipe
tune download meta-llama/Llama-2-7b-hf --output-dir /tmp/Llama-2-7b-hf
tune run lora_finetune_single_device --config llama2/7B_qlora_single_device
...
File "/data/users/ebs/ebs-torchtune/torchtune/utils/_device.py", line 96, in _validate_device_from_env
raise RuntimeError(
RuntimeError: The device cuda:0 is not available on this machine.
The offending line seems to be torch.empty(0, device=torch.device('cuda:0')). If I run this in a Python interpreter things are even more interesting..
python3
>>> import torch
>>> import torchtune
>>> torch.empty(0, device=torch.device('cuda:0'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: CUDA error: operation not supported
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Also I've run the same ipython commands as far back as 6f37d15, so I don't think there are any recent breakages on our end that would've caused this.
The text was updated successfully, but these errors were encountered:
Uh oh!
There was an error while loading. Please reload this page.
Filing this here because I cannot reproduce the issue in a PyTorch-only env, but I see it consistently when installing torchtune. The following works fine:
If we install the 8/31 nightly instead:
The offending line seems to be
torch.empty(0, device=torch.device('cuda:0'))
. If I run this in a Python interpreter things are even more interesting..Also I've run the same ipython commands as far back as 6f37d15, so I don't think there are any recent breakages on our end that would've caused this.
The text was updated successfully, but these errors were encountered: