Skip to content

Nvidia drivers 545.29.02 broken --tensor-parallel-size #1801

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Tostino opened this issue Nov 27, 2023 · 16 comments
Closed

Nvidia drivers 545.29.02 broken --tensor-parallel-size #1801

Tostino opened this issue Nov 27, 2023 · 16 comments

Comments

@Tostino
Copy link
Contributor

Tostino commented Nov 27, 2023

I just upgraded my drivers to 545.29.02 and it has broken being able to run models larger than a single GPU ram for me with vLLM.

If I pass in --tensor-parallel-size 2, things just hang when trying to create the engine. Without it, the model loads just fine (if it will fit in a single GPU's ram)

(venv) user@pop-os:/media/user/Data/IdeaProjects/vllm$ python3 -m vllm.entrypoints.openai.api_server --model teknium/OpenHermes-2.5-Mistral-7B --tensor-parallel-size 2
INFO 11-27 12:46:10 api_server.py:648] args: Namespace(host=None, port=8000, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], served_model_name=None, chat_template=None, response_role='assistant', model='teknium/OpenHermes-2.5-Mistral-7B', tokenizer=None, revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
2023-11-27 12:46:36,779 INFO worker.py:1673 -- Started a local Ray instance.
INFO 11-27 12:46:37 llm_engine.py:72] Initializing an LLM engine with config: model='teknium/OpenHermes-2.5-Mistral-7B', tokenizer='teknium/OpenHermes-2.5-Mistral-7B', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=None, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Pytorch version: '2.1.1+cu121'

(venv) user@pop-os:/media/user/Data/IdeaProjects/vllm$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

And the model never finishes loading. Nvidia-smi will show some load on the GPUs, and I have two CPU cores pegged as well.
image

image

@simon-mo
Copy link
Collaborator

Can we start by trying debugging torch distributed, which is the underlying implementation we have, try running this example code:
https://pytorch.org/tutorials/intermediate/dist_tuto.html#setup

@viktor-ferenczi
Copy link
Contributor

vLLM 0.2.1 worked before with CUDA 11.8. Installed latest CUDA 12.3 today to run 0.2.2+ from main build. This same problem (hang on model load) happens with cuda-drivers-545 on my dual 4090 system, even for models which worked before. The issue happens only with --tensor-parallel-size=2, it does not happen on a single GPU.

Tested the Torch tutorial example, it completes in a few seconds without printing anything, I guess that's good. No hang occurs.

@simon-mo
Copy link
Collaborator

To further debug the hanging, it would be great to use py-spy to check the stack of the processes to see what is it hanging on.

I think if Ray is used, running ray stack should have the same effect.

@Tostino
Copy link
Contributor Author

Tostino commented Nov 29, 2023

@simon-mo Here is ray stack while things are spinning: ray_stack.txt

@mirkogolze
Copy link

I think, I have the same issue.

I tried to start with docker

docker run --runtime nvidia --gpus all \
        -v /var/gpu-volumes/llmhub:/root/.cache/huggingface \
        -p 8000:8000 \
        --shm-size=10.24gb \
        --env "HUGGING_FACE_HUB_TOKEN=XXXXXXXXXXXXXX" \
        vllm/vllm-openai:latest \
        --model mistralai/Mistral-7B-v0.1 \
        --worker-use-ray \
        --tensor-parallel-size 2

the two RayWorker - processes running on CPU on 100%
image

The process will be killed after about 45 minutes
The console is showing the following

(RayWorker pid=4751) [E ProcessGroupNCCL.cpp:474] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800108 milliseconds before timing out.
(RayWorker pid=4751) [E ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
(RayWorker pid=4751) [E ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down.
(RayWorker pid=4751) [E ProcessGroupNCCL.cpp:915] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800108 milliseconds before timing out.
(RayWorker pid=4751) [2023-11-29 14:22:53,323 E 4751 4889] logging.cc:97: Unhandled exception: St13runtime_error. what(): [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800108 milliseconds before timing out.
(RayWorker pid=4751) [2023-11-29 14:22:53,334 E 4751 4889] logging.cc:104: Stack trace:
(RayWorker pid=4751)  /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0xf199fa) [0x7f59c79c79fa] ray::operator<<()
(RayWorker pid=4751) /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0xf1c1b8) [0x7f59c79ca1b8] ray::TerminateHandler()
(RayWorker pid=4751) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c) [0x7f59c693020c]
(RayWorker pid=4751) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277) [0x7f59c6930277]
(RayWorker pid=4751) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae1fe) [0x7f59c69301fe]
(RayWorker pid=4751) /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so(+0xc86dc5) [0x7f5675cbadc5] c10d::ProcessGroupNCCL::ncclCommWatchdog()
(RayWorker pid=4751) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f59c695e253]
(RayWorker pid=4751) /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f59c868eac3]
(RayWorker pid=4751) /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x44) [0x7f59c871fbf4] __clone
(RayWorker pid=4751)
(RayWorker pid=4751) *** SIGABRT received at time=1701267773 on cpu 77 ***
(RayWorker pid=4751) PC: @     0x7f59c86909fc  (unknown)  pthread_kill
(RayWorker pid=4751)     @     0x7f59c863c520  (unknown)  (unknown)
(RayWorker pid=4751) [2023-11-29 14:22:53,334 E 4751 4889] logging.cc:361: *** SIGABRT received at time=1701267773 on cpu 77 ***
(RayWorker pid=4751) [2023-11-29 14:22:53,334 E 4751 4889] logging.cc:361: PC: @     0x7f59c86909fc  (unknown)  pthread_kill
(RayWorker pid=4751) [2023-11-29 14:22:53,334 E 4751 4889] logging.cc:361:     @     0x7f59c863c520  (unknown)  (unknown)
(RayWorker pid=4751) Fatal Python error: Aborted

The whole console output: consoleOutput.txt

@viktor-ferenczi
Copy link
Contributor

Downgrade to version 535, that works.

If you installed CUDA from the network repository on Ubuntu 22.04, then this should work:

apt-get install nvidia-dkms-535 nvidia-utils-535 nvidia-driver-535 cuda-drivers-535

sudo apt-mark hold nvidia-dkms-535
sudo apt-mark hold nvidia-utils-535
sudo apt-mark hold nvidia-driver-535
sudo apt-mark hold cuda-drivers-535

Then reboot.

@viktor-ferenczi
Copy link
Contributor

viktor-ferenczi commented Nov 29, 2023

Related issue at llama.cpp, they had the same problem, causing broken model output (lots of hashmarks): ggml-org/llama.cpp#3772

@simon-mo
Copy link
Collaborator

My hunch would be some sort of weird nccl + pytorch + cuda combination causing deadlocks. (cf NVIDIA/nccl#1013 (comment))

@simon-mo
Copy link
Collaborator

@Tostino's stack trace show the model workers stuck on kernel launches

Stack dump for user      199183 99.1  0.9 41805784 936008 pts/8 Rl+  08:59   0:37 ray::RayWorker.execute_method
Process 199183: ray::RayWorker.execute_method
Python v3.10.12 (/usr/bin/python3.10)

Thread 199183 (active): "MainThread"
    0x7f1d9bb0324f (libcuda.so.545.29.02)
    0x7f1d9b77cb30 (libcuda.so.545.29.02)
    0x7f1d9bb01f3a (libcuda.so.545.29.02)
    0x7f1d9b88cc86 (libcuda.so.545.29.02)
    0x7f1d9b8748de (libcuda.so.545.29.02)
    0x7f1d9b877260 (libcuda.so.545.29.02)
    0x7f1d9b8d8e44 (libcuda.so.545.29.02)
    0x7f2418437c5d (nvidia/cuda_runtime/lib/libcudart.so.12)
    0x7f24184383a0 (nvidia/cuda_runtime/lib/libcudart.so.12)
    0x7f24184383ff (nvidia/cuda_runtime/lib/libcudart.so.12)
    0x7f241843af84 (nvidia/cuda_runtime/lib/libcudart.so.12)
    0x7f2418414930 (nvidia/cuda_runtime/lib/libcudart.so.12)
    cudaLaunchKernel (nvidia/cuda_runtime/lib/libcudart.so.12)
    (anonymous namespace)::gpu_kernel_with_index<__nv_hdl_wrapper_t<false, false, false, __nv_dl_tag<at::Tensor& (*)(c10::Scalar const&, c10::Scalar const&, c10::Scalar const&, at::Tensor&), &at::native::arange_cuda_out(c10::Scalar const&, c10::Scalar const&, c10::Scalar const&, at::Tensor&), (unsigned int)3>, int (long), long, long> > (libtorch_cuda.so)
    at::native::arange_cuda_out(c10::Scalar const&, c10::Scalar const&, c10::Scalar const&, at::Tensor&)::{lambda()#1}::operator()() const::{lambda()#3}::operator() const (libtorch_cuda.so)
    at::native::arange_cuda_out(c10::Scalar const&, c10::Scalar const&, c10::Scalar const&, at::Tensor&)::{lambda()#1}::operator() const (libtorch_cuda.so)
    at::native::arange_cuda_out (libtorch_cuda.so)
    at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA_start_out_arange_out (libtorch_cuda.so)
    at::_ops::arange_start_out::call (libtorch_cpu.so)
    at::native::arange (libtorch_cpu.so)
    at::native::arange (libtorch_cpu.so)
    at::native::arange (libtorch_cpu.so)
    c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor(c10::Scalar const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeExplicitAutograd__arange(c10::Scalar const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>)>, at::Tensor, c10::guts::typelist::typelist<c10::Scalar const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool> > >, at::Tensor(c10::Scalar const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>)>::call (libtorch_cpu.so)
    at::_ops::arange::redispatch (libtorch_cpu.so)
    c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor(c10::Scalar const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>), &at::(anonymous namespace)::arange(c10::Scalar const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>)>, at::Tensor, c10::guts::typelist::typelist<c10::Scalar const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool> > >, at::Tensor(c10::Scalar const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>)>::call (libtorch_cpu.so)
    at::_ops::arange::call (libtorch_cpu.so)
    torch::autograd::THPVariable_arange (libtorch_python.so)
    __init__ (vllm/model_executor/layers/attention.py:56)
...

@simon-mo
Copy link
Collaborator

I think downgrade or rolling forward (if new version released) is the safest option, unfortunately.

@Tostino
Copy link
Contributor Author

Tostino commented Nov 30, 2023

Well, looks like pop_os doesn't support downgrading drivers and there is no way for me to go back without a reinstall...

Guess i'm out of the game for a couple months until a driver update appears...I don't have the time to deal with a reinstall.

@simon-mo
Copy link
Collaborator

Can folks help me with one extra information for debugging this, what's your nccl version?

python -c "import torch;print(torch.cuda.nccl.version())"

@Tostino
Copy link
Contributor Author

Tostino commented Nov 30, 2023

(venv) user@pop-os:/media/user/Data/IdeaProjects/vllm$ python -c "import torch;print(torch.cuda.nccl.version())"
(2, 18, 1)

@mirkogolze
Copy link

for me its (2, 14, 3)

@mirkogolze
Copy link

mirkogolze commented Dec 8, 2023

We updated our server with the two A100 40GB to latest Ubuntu + latest Nvidia driver + latest CUDA and now it works as expected. So it seems so that it is really a driver problem.

Ubuntu 22.04.3 LTS
NVIDIA-SMI 545.29.06              
Driver Version: 545.29.06    
CUDA Version: 12.3

But with disabling https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-p2p-disable - NCCL_P2P_DISABLE = 1 it worked before the driver update too.

@wookayin
Copy link

@mirkogolze For the purpose of archival and context for future readers, can you write down the version of nvidia driver and CUDA with which you got it successful? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants