Skip to content

vLLM running on a Ray Cluster Hanging on Initializing #2826

@Kaotic3

Description

@Kaotic3

It isn't clear what is at fault here. Whether it be vLLM or Ray.

There is a thread here on the ray forums that outlines the issue, it is 16 days old, there is no reply to it.

https://discuss.ray.io/t/running-vllm-script-on-multi-node-cluster/13533

Taking from that thread, but this is identical for me.

2024-01-24 13:57:17,308 INFO worker.py:1540 – Connecting to existing Ray cluster at address: HOST_IP_ADDRESS…
2024-01-24 13:57:17,317 INFO worker.py:1715 – Connected to Ray cluster. View the dashboard at 127.0.0.1:8265
INFO 01-24 13:57:39 llm_engine.py:70] Initializing an LLM engine with config: model=‘mistralai/Mistral-7B-Instruct-v0.2’, tokenizer=‘mistralai/Mistral-7B-Instruct-v0.2’, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=None, enforce_eager=False, seed=0)

But after that it hangs, and eventually quits.

I have exactly this same problem. The thread details the other points, that the "ray status" seems to show nodes working and communicating, that it stays like this for an age then eventually crashes with some error messages. Everything in that thread is identical to what is happening for me.

Unfortunately the Ray forums probably don't want to engage because it is vLLM - and I am concerned that vLLM won't want to engage because it is Ray.....

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions