Nvidia drivers 545.29.02 broken --tensor-parallel-size #1801

Tostino · 2023-11-27T17:54:00Z

I just upgraded my drivers to 545.29.02 and it has broken being able to run models larger than a single GPU ram for me with vLLM.

If I pass in --tensor-parallel-size 2, things just hang when trying to create the engine. Without it, the model loads just fine (if it will fit in a single GPU's ram)

(venv) user@pop-os:/media/user/Data/IdeaProjects/vllm$ python3 -m vllm.entrypoints.openai.api_server --model teknium/OpenHermes-2.5-Mistral-7B --tensor-parallel-size 2
INFO 11-27 12:46:10 api_server.py:648] args: Namespace(host=None, port=8000, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], served_model_name=None, chat_template=None, response_role='assistant', model='teknium/OpenHermes-2.5-Mistral-7B', tokenizer=None, revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
2023-11-27 12:46:36,779 INFO worker.py:1673 -- Started a local Ray instance.
INFO 11-27 12:46:37 llm_engine.py:72] Initializing an LLM engine with config: model='teknium/OpenHermes-2.5-Mistral-7B', tokenizer='teknium/OpenHermes-2.5-Mistral-7B', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=None, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Pytorch version: '2.1.1+cu121'

(venv) user@pop-os:/media/user/Data/IdeaProjects/vllm$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

And the model never finishes loading. Nvidia-smi will show some load on the GPUs, and I have two CPU cores pegged as well.

The text was updated successfully, but these errors were encountered:

simon-mo · 2023-11-28T01:08:10Z

Can we start by trying debugging torch distributed, which is the underlying implementation we have, try running this example code:
https://pytorch.org/tutorials/intermediate/dist_tuto.html#setup

viktor-ferenczi · 2023-11-28T11:30:38Z

vLLM 0.2.1 worked before with CUDA 11.8. Installed latest CUDA 12.3 today to run 0.2.2+ from main build. This same problem (hang on model load) happens with cuda-drivers-545 on my dual 4090 system, even for models which worked before. The issue happens only with --tensor-parallel-size=2, it does not happen on a single GPU.

Tested the Torch tutorial example, it completes in a few seconds without printing anything, I guess that's good. No hang occurs.

simon-mo · 2023-11-28T19:29:36Z

To further debug the hanging, it would be great to use py-spy to check the stack of the processes to see what is it hanging on.

I think if Ray is used, running ray stack should have the same effect.

Tostino · 2023-11-29T14:01:09Z

@simon-mo Here is ray stack while things are spinning: ray_stack.txt

mirkogolze · 2023-11-29T15:12:49Z

I think, I have the same issue.

I tried to start with docker

docker run --runtime nvidia --gpus all \
        -v /var/gpu-volumes/llmhub:/root/.cache/huggingface \
        -p 8000:8000 \
        --shm-size=10.24gb \
        --env "HUGGING_FACE_HUB_TOKEN=XXXXXXXXXXXXXX" \
        vllm/vllm-openai:latest \
        --model mistralai/Mistral-7B-v0.1 \
        --worker-use-ray \
        --tensor-parallel-size 2

the two RayWorker - processes running on CPU on 100%

The process will be killed after about 45 minutes
The console is showing the following

(RayWorker pid=4751) [E ProcessGroupNCCL.cpp:474] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800108 milliseconds before timing out.
(RayWorker pid=4751) [E ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
(RayWorker pid=4751) [E ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down.
(RayWorker pid=4751) [E ProcessGroupNCCL.cpp:915] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800108 milliseconds before timing out.
(RayWorker pid=4751) [2023-11-29 14:22:53,323 E 4751 4889] logging.cc:97: Unhandled exception: St13runtime_error. what(): [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800108 milliseconds before timing out.
(RayWorker pid=4751) [2023-11-29 14:22:53,334 E 4751 4889] logging.cc:104: Stack trace:
(RayWorker pid=4751)  /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0xf199fa) [0x7f59c79c79fa] ray::operator<<()
(RayWorker pid=4751) /usr/local/lib/python3.10/dist-packages/ray/_raylet.so(+0xf1c1b8) [0x7f59c79ca1b8] ray::TerminateHandler()
(RayWorker pid=4751) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c) [0x7f59c693020c]
(RayWorker pid=4751) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277) [0x7f59c6930277]
(RayWorker pid=4751) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae1fe) [0x7f59c69301fe]
(RayWorker pid=4751) /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so(+0xc86dc5) [0x7f5675cbadc5] c10d::ProcessGroupNCCL::ncclCommWatchdog()
(RayWorker pid=4751) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f59c695e253]
(RayWorker pid=4751) /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f59c868eac3]
(RayWorker pid=4751) /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x44) [0x7f59c871fbf4] __clone
(RayWorker pid=4751)
(RayWorker pid=4751) *** SIGABRT received at time=1701267773 on cpu 77 ***
(RayWorker pid=4751) PC: @     0x7f59c86909fc  (unknown)  pthread_kill
(RayWorker pid=4751)     @     0x7f59c863c520  (unknown)  (unknown)
(RayWorker pid=4751) [2023-11-29 14:22:53,334 E 4751 4889] logging.cc:361: *** SIGABRT received at time=1701267773 on cpu 77 ***
(RayWorker pid=4751) [2023-11-29 14:22:53,334 E 4751 4889] logging.cc:361: PC: @     0x7f59c86909fc  (unknown)  pthread_kill
(RayWorker pid=4751) [2023-11-29 14:22:53,334 E 4751 4889] logging.cc:361:     @     0x7f59c863c520  (unknown)  (unknown)
(RayWorker pid=4751) Fatal Python error: Aborted

The whole console output: consoleOutput.txt

viktor-ferenczi · 2023-11-29T23:44:32Z

Downgrade to version 535, that works.

If you installed CUDA from the network repository on Ubuntu 22.04, then this should work:

apt-get install nvidia-dkms-535 nvidia-utils-535 nvidia-driver-535 cuda-drivers-535

sudo apt-mark hold nvidia-dkms-535
sudo apt-mark hold nvidia-utils-535
sudo apt-mark hold nvidia-driver-535
sudo apt-mark hold cuda-drivers-535

Then reboot.

viktor-ferenczi · 2023-11-29T23:45:24Z

Related issue at llama.cpp, they had the same problem, causing broken model output (lots of hashmarks): ggml-org/llama.cpp#3772

simon-mo · 2023-11-29T23:55:51Z

My hunch would be some sort of weird nccl + pytorch + cuda combination causing deadlocks. (cf NVIDIA/nccl#1013 (comment))

simon-mo · 2023-11-29T23:57:11Z

@Tostino's stack trace show the model workers stuck on kernel launches

Stack dump for user      199183 99.1  0.9 41805784 936008 pts/8 Rl+  08:59   0:37 ray::RayWorker.execute_method
Process 199183: ray::RayWorker.execute_method
Python v3.10.12 (/usr/bin/python3.10)

Thread 199183 (active): "MainThread"
    0x7f1d9bb0324f (libcuda.so.545.29.02)
    0x7f1d9b77cb30 (libcuda.so.545.29.02)
    0x7f1d9bb01f3a (libcuda.so.545.29.02)
    0x7f1d9b88cc86 (libcuda.so.545.29.02)
    0x7f1d9b8748de (libcuda.so.545.29.02)
    0x7f1d9b877260 (libcuda.so.545.29.02)
    0x7f1d9b8d8e44 (libcuda.so.545.29.02)
    0x7f2418437c5d (nvidia/cuda_runtime/lib/libcudart.so.12)
    0x7f24184383a0 (nvidia/cuda_runtime/lib/libcudart.so.12)
    0x7f24184383ff (nvidia/cuda_runtime/lib/libcudart.so.12)
    0x7f241843af84 (nvidia/cuda_runtime/lib/libcudart.so.12)
    0x7f2418414930 (nvidia/cuda_runtime/lib/libcudart.so.12)
    cudaLaunchKernel (nvidia/cuda_runtime/lib/libcudart.so.12)
    (anonymous namespace)::gpu_kernel_with_index<__nv_hdl_wrapper_t<false, false, false, __nv_dl_tag<at::Tensor& (*)(c10::Scalar const&, c10::Scalar const&, c10::Scalar const&, at::Tensor&), &at::native::arange_cuda_out(c10::Scalar const&, c10::Scalar const&, c10::Scalar const&, at::Tensor&), (unsigned int)3>, int (long), long, long> > (libtorch_cuda.so)
    at::native::arange_cuda_out(c10::Scalar const&, c10::Scalar const&, c10::Scalar const&, at::Tensor&)::{lambda()#1}::operator()() const::{lambda()#3}::operator() const (libtorch_cuda.so)
    at::native::arange_cuda_out(c10::Scalar const&, c10::Scalar const&, c10::Scalar const&, at::Tensor&)::{lambda()#1}::operator() const (libtorch_cuda.so)
    at::native::arange_cuda_out (libtorch_cuda.so)
    at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA_start_out_arange_out (libtorch_cuda.so)
    at::_ops::arange_start_out::call (libtorch_cpu.so)
    at::native::arange (libtorch_cpu.so)
    at::native::arange (libtorch_cpu.so)
    at::native::arange (libtorch_cpu.so)
    c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor(c10::Scalar const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeExplicitAutograd__arange(c10::Scalar const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>)>, at::Tensor, c10::guts::typelist::typelist<c10::Scalar const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool> > >, at::Tensor(c10::Scalar const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>)>::call (libtorch_cpu.so)
    at::_ops::arange::redispatch (libtorch_cpu.so)
    c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor(c10::Scalar const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>), &at::(anonymous namespace)::arange(c10::Scalar const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>)>, at::Tensor, c10::guts::typelist::typelist<c10::Scalar const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool> > >, at::Tensor(c10::Scalar const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>)>::call (libtorch_cpu.so)
    at::_ops::arange::call (libtorch_cpu.so)
    torch::autograd::THPVariable_arange (libtorch_python.so)
    __init__ (vllm/model_executor/layers/attention.py:56)
...

simon-mo · 2023-11-29T23:57:46Z

I think downgrade or rolling forward (if new version released) is the safest option, unfortunately.

Tostino · 2023-11-30T01:10:46Z

Well, looks like pop_os doesn't support downgrading drivers and there is no way for me to go back without a reinstall...

Guess i'm out of the game for a couple months until a driver update appears...I don't have the time to deal with a reinstall.

simon-mo · 2023-11-30T02:36:56Z

Can folks help me with one extra information for debugging this, what's your nccl version?

python -c "import torch;print(torch.cuda.nccl.version())"

Tostino · 2023-11-30T02:54:24Z

(venv) user@pop-os:/media/user/Data/IdeaProjects/vllm$ python -c "import torch;print(torch.cuda.nccl.version())"
(2, 18, 1)

mirkogolze · 2023-11-30T05:45:40Z

for me its (2, 14, 3)

mirkogolze · 2023-12-08T12:47:52Z

We updated our server with the two A100 40GB to latest Ubuntu + latest Nvidia driver + latest CUDA and now it works as expected. So it seems so that it is really a driver problem.

Ubuntu 22.04.3 LTS
NVIDIA-SMI 545.29.06              
Driver Version: 545.29.06    
CUDA Version: 12.3

But with disabling https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-p2p-disable - NCCL_P2P_DISABLE = 1 it worked before the driver update too.

wookayin · 2023-12-15T06:58:49Z

@mirkogolze For the purpose of archival and context for future readers, can you write down the version of nvidia driver and CUDA with which you got it successful? Thanks.

simon-mo mentioned this issue Nov 30, 2023

Model Loading Stuck (in ray ?) #1846

Closed

SebastianBodza mentioned this issue Feb 5, 2024

Mixtral GPTQ with TP=2 not generating output #2728

Closed

akowalsk mentioned this issue Feb 13, 2024

NCCL Error on multi-gpu setup with Nvidia driver 545 huggingface/text-generation-inference#1559

Closed

4 tasks

hanzhi713 mentioned this issue Feb 18, 2024

Direct P2P GPU <-> GPU communication with torch.to does not seem to work. pytorch/pytorch#119638

Closed

hmellor closed this as completed Mar 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Nvidia drivers 545.29.02 broken --tensor-parallel-size #1801

Nvidia drivers 545.29.02 broken --tensor-parallel-size #1801

Tostino commented Nov 27, 2023 •

edited

Loading

simon-mo commented Nov 28, 2023

Uh oh!

viktor-ferenczi commented Nov 28, 2023

Uh oh!

simon-mo commented Nov 28, 2023

Uh oh!

Tostino commented Nov 29, 2023 •

edited

Loading

Uh oh!

mirkogolze commented Nov 29, 2023

Uh oh!

viktor-ferenczi commented Nov 29, 2023

Uh oh!

viktor-ferenczi commented Nov 29, 2023 •

edited

Loading

Uh oh!

simon-mo commented Nov 29, 2023

Uh oh!

simon-mo commented Nov 29, 2023

Uh oh!

simon-mo commented Nov 29, 2023

Uh oh!

Tostino commented Nov 30, 2023

Uh oh!

simon-mo commented Nov 30, 2023

Uh oh!

Tostino commented Nov 30, 2023

Uh oh!

mirkogolze commented Nov 30, 2023

Uh oh!

mirkogolze commented Dec 8, 2023 •

edited

Loading

Uh oh!

wookayin commented Dec 15, 2023

Uh oh!

Uh oh!

Nvidia drivers 545.29.02 broken --tensor-parallel-size #1801

Nvidia drivers 545.29.02 broken --tensor-parallel-size #1801

Comments

Tostino commented Nov 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

simon-mo commented Nov 28, 2023

Uh oh!

viktor-ferenczi commented Nov 28, 2023

Uh oh!

simon-mo commented Nov 28, 2023

Uh oh!

Tostino commented Nov 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mirkogolze commented Nov 29, 2023

Uh oh!

viktor-ferenczi commented Nov 29, 2023

Uh oh!

viktor-ferenczi commented Nov 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simon-mo commented Nov 29, 2023

Uh oh!

simon-mo commented Nov 29, 2023

Uh oh!

simon-mo commented Nov 29, 2023

Uh oh!

Tostino commented Nov 30, 2023

Uh oh!

simon-mo commented Nov 30, 2023

Uh oh!

Tostino commented Nov 30, 2023

Uh oh!

mirkogolze commented Nov 30, 2023

Uh oh!

mirkogolze commented Dec 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wookayin commented Dec 15, 2023

Uh oh!

Tostino commented Nov 27, 2023 •

edited

Loading

Tostino commented Nov 29, 2023 •

edited

Loading

viktor-ferenczi commented Nov 29, 2023 •

edited

Loading

mirkogolze commented Dec 8, 2023 •

edited

Loading