-
-
Notifications
You must be signed in to change notification settings - Fork 7.7k
Nvidia drivers 545.29.02 broken --tensor-parallel-size #1801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can we start by trying debugging torch distributed, which is the underlying implementation we have, try running this example code: |
vLLM 0.2.1 worked before with CUDA 11.8. Installed latest CUDA 12.3 today to run 0.2.2+ from Tested the Torch tutorial example, it completes in a few seconds without printing anything, I guess that's good. No hang occurs. |
To further debug the hanging, it would be great to use I think if Ray is used, running |
@simon-mo Here is ray stack while things are spinning: ray_stack.txt |
I think, I have the same issue. I tried to start with docker
the two RayWorker - processes running on CPU on 100% The process will be killed after about 45 minutes
The whole console output: consoleOutput.txt |
Downgrade to version 535, that works. If you installed CUDA from the network repository on Ubuntu 22.04, then this should work:
Then reboot. |
Related issue at llama.cpp, they had the same problem, causing broken model output (lots of hashmarks): ggml-org/llama.cpp#3772 |
My hunch would be some sort of weird nccl + pytorch + cuda combination causing deadlocks. (cf NVIDIA/nccl#1013 (comment)) |
@Tostino's stack trace show the model workers stuck on kernel launches
|
I think downgrade or rolling forward (if new version released) is the safest option, unfortunately. |
Well, looks like pop_os doesn't support downgrading drivers and there is no way for me to go back without a reinstall... Guess i'm out of the game for a couple months until a driver update appears...I don't have the time to deal with a reinstall. |
Can folks help me with one extra information for debugging this, what's your nccl version?
|
|
for me its (2, 14, 3) |
We updated our server with the two A100 40GB to latest Ubuntu + latest Nvidia driver + latest CUDA and now it works as expected. So it seems so that it is really a driver problem.
But with disabling https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-p2p-disable - |
@mirkogolze For the purpose of archival and context for future readers, can you write down the version of nvidia driver and CUDA with which you got it successful? Thanks. |
Uh oh!
There was an error while loading. Please reload this page.
I just upgraded my drivers to 545.29.02 and it has broken being able to run models larger than a single GPU ram for me with vLLM.
If I pass in
--tensor-parallel-size 2
, things just hang when trying to create the engine. Without it, the model loads just fine (if it will fit in a single GPU's ram)Pytorch version: '2.1.1+cu121'
And the model never finishes loading. Nvidia-smi will show some load on the GPUs, and I have two CPU cores pegged as well.

The text was updated successfully, but these errors were encountered: