vLLM running on a Ray Cluster Hanging on Initializing

It isn't clear what is at fault here.  Whether it be vLLM or Ray.

There is a thread here on the ray forums that outlines the issue, it is 16 days old, there is no reply to it.

https://discuss.ray.io/t/running-vllm-script-on-multi-node-cluster/13533

Taking from that thread, but this is identical for me.

```
2024-01-24 13:57:17,308 INFO worker.py:1540 – Connecting to existing Ray cluster at address: HOST_IP_ADDRESS…
2024-01-24 13:57:17,317 INFO worker.py:1715 – Connected to Ray cluster. View the dashboard at 127.0.0.1:8265
INFO 01-24 13:57:39 llm_engine.py:70] Initializing an LLM engine with config: model=‘mistralai/Mistral-7B-Instruct-v0.2’, tokenizer=‘mistralai/Mistral-7B-Instruct-v0.2’, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=None, enforce_eager=False, seed=0)

But after that it hangs, and eventually quits.
```

I have exactly this same problem.  The thread details the other points, that the "ray status" seems to show nodes working and communicating, that it stays like this for an age then eventually crashes with some error messages.  Everything in that thread is identical to what is happening for me.

Unfortunately the Ray forums probably don't want to engage because it is vLLM - and I am concerned that vLLM won't want to engage because it is Ray.....

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

vLLM running on a Ray Cluster Hanging on Initializing #2826

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

vLLM running on a Ray Cluster Hanging on Initializing #2826

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions