Skip to content

vLLM getting stuck. Nothing is generate while requests are running and pending. #2731

@NikolaBorisov

Description

@NikolaBorisov

We are seeing the latest version of vllm getting stuck randomly after some minutes of work. Sometimes after an hour.

The server still receives new request and can reply to health and metrics, but no tokens are generate, no requests complete.
Server keeps printing the status every 5 seconds, but no tokens are generated. As if the loop is stuck.

INFO 02-01 06:36:05 llm_engine.py:921] Avg prompt throughput: 382.6 tokens/s, Avg generation throughput: 118.5 tokens/s, Max iteration time: 386.7 ms, Avg time/tok:149.4 ms, Running: 35 reqs, Swapped: 0 reqs, Pending: 115 reqs, GPU KV cache usage: 99.0%, CPU KV cache usage: 0.0%
INFO 02-01 06:36:05 async_llm_engine.py:110] Finished request cmpl-50c32d7a66084c3f9980d2bf06d79900-0.
INFO 02-01 06:36:05 async_llm_engine.py:110] Finished request cmpl-90590e17ce6b4fa4b19f0812c0c98446-0.
INFO:     10.244.5.235:41834 - "POST /v1/completions HTTP/1.1" 200 OK
INFO:     10.244.5.237:53262 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 02-01 06:36:05 async_llm_engine.py:436] Received request cmpl-a523b4f84b1b491d9f61ddc4558f532b-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:05 async_llm_engine.py:436] Received request cmpl-25d5d0f7555c46f588570cc83d3a0f81-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO:     10.244.6.107:43538 - "GET /metrics HTTP/1.1" 200 OK
INFO 02-01 06:36:05 async_llm_engine.py:436] Received request cmpl-cf7f4d9b34144b3f8efc55498f75c782-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:07 async_llm_engine.py:436] Received request cmpl-fe5c335b41654d2b9e1141819f92e762-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:08 async_llm_engine.py:436] Received request cmpl-07d703ba4be6400d95704ae748e9c752-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:10 async_llm_engine.py:436] Received request cmpl-61ddd4fa2c074d108048d88e884f5bef-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:10 llm_engine.py:921] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 7.2 tokens/s, Max iteration time: 107.8 ms, Avg time/tok:107.8 ms, Running: 35 reqs, Swapped: 0 reqs, Pending: 116 reqs, GPU KV cache usage: 99.0%, CPU KV cache usage: 0.0%
INFO:     10.244.39.1:46396 - "GET /health HTTP/1.1" 200 OK
INFO 02-01 06:36:11 async_llm_engine.py:436] Received request cmpl-37d1f11755354de88177e21d466f9ae4-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:12 async_llm_engine.py:436] Received request cmpl-bfd5818ad84548fdb8fbd3ed075d8a00-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:12 async_llm_engine.py:436] Received request cmpl-cd834640585140fabf9f9f5342d08617-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.5, repetition_penalty=1.0, temperature=0.1, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['USER:', 'ASSISTANT:', 'Reference(s):', 'Note:'], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=250, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:13 async_llm_engine.py:436] Received request cmpl-d0a32735719f4425ac7bcc47d73e4c6a-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO:     10.244.41.46:34592 - "GET /health HTTP/1.1" 200 OK
INFO 02-01 06:36:15 async_llm_engine.py:436] Received request cmpl-28df8dd0e22140d09d2eb497eabb2ae6-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:16 async_llm_engine.py:436] Received request cmpl-0891eb16b2734aa0be69ac560ce76262-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:18 async_llm_engine.py:436] Received request cmpl-bfe0ccd536e84b858fa4a6455cc3c84e-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:18 async_llm_engine.py:436] Received request cmpl-fa92185c03d64100a87b055f8de9ebec-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:18 async_llm_engine.py:436] Received request cmpl-75fbe42838b44889b2365a8f896769d9-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:20 async_llm_engine.py:436] Received request cmpl-9a7be2f4394747bc86b924fff8729e53-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO:     10.244.6.107:41448 - "GET /metrics HTTP/1.1" 200 OK
INFO 02-01 06:36:20 llm_engine.py:921] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Max iteration time: 0.0 ms, Avg time/tok:0.0 ms, Running: 35 reqs, Swapped: 0 reqs, Pending: 116 reqs, GPU KV cache usage: 99.0%, CPU KV cache usage: 0.0%
INFO:     10.244.39.1:50926 - "GET /health HTTP/1.1" 200 OK
INFO:     10.244.41.46:51998 - "GET /health HTTP/1.1" 200 OK
INFO 02-01 06:36:28 async_llm_engine.py:436] Received request cmpl-2ad4077818664575a61e96651cc8ff02-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:28 async_llm_engine.py:436] Received request cmpl-ee0f799a0def4b3abda0e6e3b782fc9a-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:28 async_llm_engine.py:436] Received request cmpl-452bd3a0cb2143feb93ff7c253350954-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:30 llm_engine.py:921] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Max iteration time: 0.0 ms, Avg time/tok:0.0 ms, Running: 35 reqs, Swapped: 0 reqs, Pending: 116 reqs, GPU KV cache usage: 99.0%, CPU KV cache usage: 0.0%
INFO:     10.244.39.1:46878 - "GET /health HTTP/1.1" 200 OK
INFO:     10.244.41.46:54412 - "GET /health HTTP/1.1" 200 OK

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions