-
-
Notifications
You must be signed in to change notification settings - Fork 10.4k
Closed
Description
We are seeing the latest version of vllm getting stuck randomly after some minutes of work. Sometimes after an hour.
The server still receives new request and can reply to health and metrics, but no tokens are generate, no requests complete.
Server keeps printing the status every 5 seconds, but no tokens are generated. As if the loop is stuck.
INFO 02-01 06:36:05 llm_engine.py:921] Avg prompt throughput: 382.6 tokens/s, Avg generation throughput: 118.5 tokens/s, Max iteration time: 386.7 ms, Avg time/tok:149.4 ms, Running: 35 reqs, Swapped: 0 reqs, Pending: 115 reqs, GPU KV cache usage: 99.0%, CPU KV cache usage: 0.0%
INFO 02-01 06:36:05 async_llm_engine.py:110] Finished request cmpl-50c32d7a66084c3f9980d2bf06d79900-0.
INFO 02-01 06:36:05 async_llm_engine.py:110] Finished request cmpl-90590e17ce6b4fa4b19f0812c0c98446-0.
INFO: 10.244.5.235:41834 - "POST /v1/completions HTTP/1.1" 200 OK
INFO: 10.244.5.237:53262 - "POST /v1/completions HTTP/1.1" 200 OK
INFO 02-01 06:36:05 async_llm_engine.py:436] Received request cmpl-a523b4f84b1b491d9f61ddc4558f532b-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:05 async_llm_engine.py:436] Received request cmpl-25d5d0f7555c46f588570cc83d3a0f81-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO: 10.244.6.107:43538 - "GET /metrics HTTP/1.1" 200 OK
INFO 02-01 06:36:05 async_llm_engine.py:436] Received request cmpl-cf7f4d9b34144b3f8efc55498f75c782-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:07 async_llm_engine.py:436] Received request cmpl-fe5c335b41654d2b9e1141819f92e762-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:08 async_llm_engine.py:436] Received request cmpl-07d703ba4be6400d95704ae748e9c752-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:10 async_llm_engine.py:436] Received request cmpl-61ddd4fa2c074d108048d88e884f5bef-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:10 llm_engine.py:921] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 7.2 tokens/s, Max iteration time: 107.8 ms, Avg time/tok:107.8 ms, Running: 35 reqs, Swapped: 0 reqs, Pending: 116 reqs, GPU KV cache usage: 99.0%, CPU KV cache usage: 0.0%
INFO: 10.244.39.1:46396 - "GET /health HTTP/1.1" 200 OK
INFO 02-01 06:36:11 async_llm_engine.py:436] Received request cmpl-37d1f11755354de88177e21d466f9ae4-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:12 async_llm_engine.py:436] Received request cmpl-bfd5818ad84548fdb8fbd3ed075d8a00-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:12 async_llm_engine.py:436] Received request cmpl-cd834640585140fabf9f9f5342d08617-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.5, repetition_penalty=1.0, temperature=0.1, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['USER:', 'ASSISTANT:', 'Reference(s):', 'Note:'], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=250, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:13 async_llm_engine.py:436] Received request cmpl-d0a32735719f4425ac7bcc47d73e4c6a-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO: 10.244.41.46:34592 - "GET /health HTTP/1.1" 200 OK
INFO 02-01 06:36:15 async_llm_engine.py:436] Received request cmpl-28df8dd0e22140d09d2eb497eabb2ae6-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:16 async_llm_engine.py:436] Received request cmpl-0891eb16b2734aa0be69ac560ce76262-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:18 async_llm_engine.py:436] Received request cmpl-bfe0ccd536e84b858fa4a6455cc3c84e-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:18 async_llm_engine.py:436] Received request cmpl-fa92185c03d64100a87b055f8de9ebec-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:18 async_llm_engine.py:436] Received request cmpl-75fbe42838b44889b2365a8f896769d9-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:20 async_llm_engine.py:436] Received request cmpl-9a7be2f4394747bc86b924fff8729e53-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO: 10.244.6.107:41448 - "GET /metrics HTTP/1.1" 200 OK
INFO 02-01 06:36:20 llm_engine.py:921] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Max iteration time: 0.0 ms, Avg time/tok:0.0 ms, Running: 35 reqs, Swapped: 0 reqs, Pending: 116 reqs, GPU KV cache usage: 99.0%, CPU KV cache usage: 0.0%
INFO: 10.244.39.1:50926 - "GET /health HTTP/1.1" 200 OK
INFO: 10.244.41.46:51998 - "GET /health HTTP/1.1" 200 OK
INFO 02-01 06:36:28 async_llm_engine.py:436] Received request cmpl-2ad4077818664575a61e96651cc8ff02-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:28 async_llm_engine.py:436] Received request cmpl-ee0f799a0def4b3abda0e6e3b782fc9a-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:28 async_llm_engine.py:436] Received request cmpl-452bd3a0cb2143feb93ff7c253350954-0: prompt: None, prefix_pos: None,sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=512, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: [], lora_request: None.
INFO 02-01 06:36:30 llm_engine.py:921] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Max iteration time: 0.0 ms, Avg time/tok:0.0 ms, Running: 35 reqs, Swapped: 0 reqs, Pending: 116 reqs, GPU KV cache usage: 99.0%, CPU KV cache usage: 0.0%
INFO: 10.244.39.1:46878 - "GET /health HTTP/1.1" 200 OK
INFO: 10.244.41.46:54412 - "GET /health HTTP/1.1" 200 OK
gyin94, abaveja313, Missmiaom and JiangpengLI86
Metadata
Metadata
Assignees
Labels
No labels