-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
Closed
Description
It isn't clear what is at fault here. Whether it be vLLM or Ray.
There is a thread here on the ray forums that outlines the issue, it is 16 days old, there is no reply to it.
https://discuss.ray.io/t/running-vllm-script-on-multi-node-cluster/13533
Taking from that thread, but this is identical for me.
2024-01-24 13:57:17,308 INFO worker.py:1540 – Connecting to existing Ray cluster at address: HOST_IP_ADDRESS…
2024-01-24 13:57:17,317 INFO worker.py:1715 – Connected to Ray cluster. View the dashboard at 127.0.0.1:8265
INFO 01-24 13:57:39 llm_engine.py:70] Initializing an LLM engine with config: model=‘mistralai/Mistral-7B-Instruct-v0.2’, tokenizer=‘mistralai/Mistral-7B-Instruct-v0.2’, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=None, enforce_eager=False, seed=0)
But after that it hangs, and eventually quits.
I have exactly this same problem. The thread details the other points, that the "ray status" seems to show nodes working and communicating, that it stays like this for an age then eventually crashes with some error messages. Everything in that thread is identical to what is happening for me.
Unfortunately the Ray forums probably don't want to engage because it is vLLM - and I am concerned that vLLM won't want to engage because it is Ray.....
kyang888, jony0113, sidnb13, gserapio, wygao8 and 2 morekyang888 and sidnb13
Metadata
Metadata
Assignees
Labels
No labels