Skip to content

[Bug]: GLM4.1v-9b-thinking support video #2201

@ttttzh

Description

@ttttzh

Your current environment

unsupport video for GLM4.1v-thinking`
using vllm-ascend==0.9.2rc2.dev131+ge38fab0.d20250804
vllm==0.10.1.dev147+gb18b417fb.empty

 vllm serve /model --limit-mm-per-prompt '{"image":32}' --allowed-local-media-path / --port 8080  --gpu-memory-utilization 0.85



curl --location 'http://127.0.0.1:xxxx/v1/chat/completions' --header 'Content-Type: application/json' --data '{
    "messages": [                                                                                                                                                            {                                                                                                                                                                        "role": "user",                                                               
            "content": [
                {
                    "type": "video_url",
                    "video_url": {
                        "url": "file:///xxx.mp4"
                    }
                },
                {
                    "type": "text",
                    "text": "这个视频在呈现什么画面"
                }
            ]
        }
    ],
    "model": "/model",
    "debug": false,
    "stream": false
}'

WARNING 08-04 15:22:44 [glm4_1v.py:1090] Total frames in metadata (776) does not match the length of video array 32. This can be because the video is resampled in advance. This may cause a divergence with HF implementation.
INFO 08-04 15:22:44 [async_llm.py:273] Added request chatcmpl-97bdd1e7c25b467e85fec6d3c6ea34c2.
..[rank0]:[E804 15:23:20.853877511 compiler_depend.ts:429] call aclnnIndexPutImpl failed, detail:E89999: Inner Error!
E89999: [PID: 625] 2025-08-04-15:23:20.869.498 op[Fusion], input shapes [2691] cannot broadcast to shape [5382][FUNC:AddShape][FILE:fusion.cc][LINE:79]
TraceBack (most recent call last):
op[BroadcastTo], add input shapes failed[FUNC:CompletedShapes][FILE:broadcast_v3.cc][LINE:1686]
Autotiling func failed[FUNC:AutoTilingRun][FILE:auto_tiling_rt2.cc][LINE:109]
op[BroadcastTo], call DoTiling failed[FUNC:Tiling4BroadcastTo][FILE:broadcastto.cc][LINE:104]
Tiling failed
Tiling Failed.
Kernel Run failed. opType: 37, BroadcastTo
launch failed for BroadcastTo, errno:561103.

[ERROR] 2025-08-04-15:23:20 (PID:625, Device:0, RankID:-1) ERR01100 OPS call acl api failed
Exception raised from operator() at build/CMakeFiles/torch_npu.dir/compiler_depend.ts:73 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0xb8 (0xffff7f26c908 in /usr/local/python3.10.17/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x6c (0xffff7f21b404 in /usr/local/python3.10.17/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: + 0xffd5f8 (0xfffdd6a8d5f8 in /usr/local/python3.10.17/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so)
frame #3: + 0x192b4e0 (0xfffdd73bb4e0 in /usr/local/python3.10.17/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so)
frame #4: + 0x8115e4 (0xfffdd62a15e4 in /usr/local/python3.10.17/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so)
frame #5: + 0x813814 (0xfffdd62a3814 in /usr/local/python3.10.17/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so)
frame #6: + 0x810184 (0xfffdd62a0184 in /usr/local/python3.10.17/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so)
frame #7: + 0x4c9e4c (0xffff7f2a9e4c in /usr/local/python3.10.17/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: + 0x7d5b8 (0xffff89d9d5b8 in /lib/aarch64-linux-gnu/libc.so.6)
frame #9: + 0xe5edc (0xffff89e05edc in /lib/aarch64-linux-gnu/libc.so.6)

ERROR 08-04 15:23:20 [dump_input.py:69] Dumping input data for V1 LLM engine (v0.10.1.dev147+gb18b417fb) with config: model='/model', speculative_config=None, tokenizer='/model', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=65536, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=npu, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/model, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"/root/.cache/vllm/torch_compile_cache/e536b67d21","backend":"","custom_ops":["all"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.unified_ascend_attention_with_output","vllm.unified_ascend_attention_with_output","vllm.unified_ascend_attention_with_output"],"use_inductor":false,"compile_sizes":[],"inductor_compile_config":{},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,488,480,464,456,440,432,416,408,392,384,368,360,344,336,328,312,304,288,280,264,256,240,232,216,208,192,184,168,160,152,136,128,112,104,88,80,64,56,40,32,16,8,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":"/root/.cache/vllm/torch_compile_cache/e536b67d21/rank_0_0/backbone"},
ERROR 08-04 15:23:20 [dump_input.py:76] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=chatcmpl-97bdd1e7c25b467e85fec6d3c6ea34c2,prompt_token_ids_len=5402,mm_inputs=[{'video_grid_thw': tensor([[ 1, 78, 138]]), 'pixel_values_videos': tensor([[-0.0986, -0.1133, -0.1572, ..., -1.2812, -1.4531, -1.4531],
ERROR 08-04 15:23:20 [dump_input.py:76] [-0.3027, -0.2891, -0.4492, ..., -1.3672, -1.2812, -1.4531],
ERROR 08-04 15:23:20 [dump_input.py:76] [-0.1865, -0.1865, -0.2012, ..., -1.3828, -1.4766, -1.4688],
ERROR 08-04 15:23:20 [dump_input.py:76] ...,
ERROR 08-04 15:23:20 [dump_input.py:76] [-0.3184, -0.3184, -0.3184, ..., -1.4766, -1.4766, -1.4766],
ERROR 08-04 15:23:20 [dump_input.py:76] [-0.3770, -0.3613, -0.3184, ..., -1.4766, -1.4766, -1.4766],
ERROR 08-04 15:23:20 [dump_input.py:76] [-0.4062, -0.4062, -0.3613, ..., -1.4531, -1.4531, -1.4531]],
ERROR 08-04 15:23:20 [dump_input.py:76] dtype=torch.bfloat16)}],mm_hashes=['f65be65580c878750e77fbb9f1f14e2b5a402b9d0bed68644e3b1a4828c3a330'],mm_positions=[PlaceholderRange(offset=4, length=5390, is_embed=tensor([False, False, True, ..., False, False, False]))],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[151336, 151338, 151348], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=65521, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None),block_ids=([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16],),num_computed_tokens=0,lora_request=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_from_preemption=[], new_token_ids=[], new_block_ids=[], num_computed_tokens=[]), num_scheduled_tokens={chatcmpl-97bdd1e7c25b467e85fec6d3c6ea34c2: 2048}, total_num_scheduled_tokens=2048, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={chatcmpl-97bdd1e7c25b467e85fec6d3c6ea34c2: [0]}, num_common_prefix_blocks=[16], finished_req_ids=[], free_encoder_input_ids=[], structured_output_request_ids={}, grammar_bitmask=null, kv_connector_metadata=null)
ERROR 08-04 15:23:20 [dump_input.py:79] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, kv_cache_usage=0.0026372944461681147, prefix_cache_stats=PrefixCacheStats(reset=False, requests=1, queries=5402, hits=0), spec_decoding_stats=None, num_corrupted_reqs=0)
ERROR 08-04 15:23:20 [core.py:649] EngineCore encountered a fatal error.
ERROR 08-04 15:23:20 [core.py:649] Traceback (most recent call last):
ERROR 08-04 15:23:20 [core.py:649] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 640, in run_engine_core
ERROR 08-04 15:23:20 [core.py:649] engine_core.run_busy_loop()
ERROR 08-04 15:23:20 [core.py:649] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 667, in run_busy_loop
ERROR 08-04 15:23:20 [core.py:649] self._process_engine_step()
ERROR 08-04 15:23:20 [core.py:649] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 692, in _process_engine_step
ERROR 08-04 15:23:20 [core.py:649] outputs, model_executed = self.step_fn()
ERROR 08-04 15:23:20 [core.py:649] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 281, in step
ERROR 08-04 15:23:20 [core.py:649] model_output = self.execute_model_with_error_logging(
ERROR 08-04 15:23:20 [core.py:649] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 267, in execute_model_with_error_logging
ERROR 08-04 15:23:20 [core.py:649] raise err
ERROR 08-04 15:23:20 [core.py:649] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 258, in execute_model_with_error_logging
ERROR 08-04 15:23:20 [core.py:649] return model_fn(scheduler_output)
ERROR 08-04 15:23:20 [core.py:649] File "/vllm-workspace/vllm/vllm/v1/executor/abstract.py", line 87, in execute_model
ERROR 08-04 15:23:20 [core.py:649] output = self.collective_rpc("execute_model",
ERROR 08-04 15:23:20 [core.py:649] File "/vllm-workspace/vllm/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
ERROR 08-04 15:23:20 [core.py:649] answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 08-04 15:23:20 [core.py:649] File "/vllm-workspace/vllm/vllm/utils/init.py", line 2987, in run_method
ERROR 08-04 15:23:20 [core.py:649] return func(*args, **kwargs)
ERROR 08-04 15:23:20 [core.py:649] File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/worker_v1.py", line 201, in execute_model
ERROR 08-04 15:23:20 [core.py:649] output = self.model_runner.execute_model(scheduler_output,
ERROR 08-04 15:23:20 [core.py:649] File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 08-04 15:23:20 [core.py:649] return func(*args, **kwargs)
ERROR 08-04 15:23:20 [core.py:649] File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 1534, in execute_model
ERROR 08-04 15:23:20 [core.py:649] finished_recving) = (self._process_reqs(scheduler_output,
ERROR 08-04 15:23:20 [core.py:649] File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 1158, in _process_reqs
ERROR 08-04 15:23:20 [core.py:649] mm_embeds = self._gather_mm_embeddings(scheduler_output)
ERROR 08-04 15:23:20 [core.py:649] File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 959, in _gather_mm_embeddings
ERROR 08-04 15:23:20 [core.py:649] mm_embeds_item = gather_mm_placeholders(
ERROR 08-04 15:23:20 [core.py:649] File "/vllm-workspace/vllm/vllm/v1/worker/utils.py", line 83, in gather_mm_placeholders
ERROR 08-04 15:23:20 [core.py:649] return placeholders[is_embed]
ERROR 08-04 15:23:20 [core.py:649] RuntimeError: The Inner error is reported as above. The process exits for this inner error, and the current working operator name is aclnnIndexPutImpl.
ERROR 08-04 15:23:20 [core.py:649] Since the operator is called asynchronously, the stacktrace may be inaccurate. If you want to get the accurate stacktrace, pleace set the environment variable ASCEND_LAUNCH_BLOCKING=1.
ERROR 08-04 15:23:20 [core.py:649] Note: ASCEND_LAUNCH_BLOCKING=1 will force ops to run in synchronous mode, resulting in performance degradation. Please unset ASCEND_LAUNCH_BLOCKING in time after debugging.
ERROR 08-04 15:23:20 [core.py:649] [ERROR] 2025-08-04-15:23:20 (PID:625, Device:0, RankID:-1) ERR00100 PTA call acl api failed.
ERROR 08-04 15:23:20 [core.py:649]
ERROR 08-04 15:23:20 [async_llm.py:420] AsyncLLM output_handler failed.
ERROR 08-04 15:23:20 [async_llm.py:420] Traceback (most recent call last):
ERROR 08-04 15:23:20 [async_llm.py:420] File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 379, in output_handler
ERROR 08-04 15:23:20 [async_llm.py:420] outputs = await engine_core.get_output_async()
ERROR 08-04 15:23:20 [async_llm.py:420] File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 805, in get_output_async
ERROR 08-04 15:23:20 [async_llm.py:420] raise self._format_exception(outputs) from None
ERROR 08-04 15:23:20 [async_llm.py:420] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
INFO 08-04 15:23:20 [async_llm.py:346] Request chatcmpl-97bdd1e7c25b467e85fec6d3c6ea34c2 failed (engine dead).

🐛 Describe the bug

[rank0]:[E804 15:43:37.334754499 compiler_depend.ts:429] call aclnnIndexPutImpl failed, detail:E89999: Inner Error!
E89999: [PID: 3757] 2025-08-04-15:43:37.355.061 op[Fusion], input shapes [2691] cannot broadcast to shape [5382][FUNC:AddShape][FILE:fusion.cc][LINE:79]
TraceBack (most recent call last):
op[BroadcastTo], add input shapes failed[FUNC:CompletedShapes][FILE:broadcast_v3.cc][LINE:1686]
Autotiling func failed[FUNC:AutoTilingRun][FILE:auto_tiling_rt2.cc][LINE:109]
op[BroadcastTo], call DoTiling failed[FUNC:Tiling4BroadcastTo][FILE:broadcastto.cc][LINE:104]
Tiling failed
Tiling Failed.
Kernel Run failed. opType: 37, BroadcastTo
launch failed for BroadcastTo, errno:561103.

[ERROR] 2025-08-04-15:43:37 (PID:3757, Device:0, RankID:-1) ERR01100 OPS call acl api failed
Exception raised from operator() at build/CMakeFiles/torch_npu.dir/compiler_depend.ts:73 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0xb8 (0xffff7805c908 in /usr/local/python3.10.17/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x6c (0xffff7800b404 in /usr/local/python3.10.17/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: + 0xffd5f8 (0xfffdcf87d5f8 in /usr/local/python3.10.17/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so)
frame #3: + 0x192b4e0 (0xfffdd01ab4e0 in /usr/local/python3.10.17/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so)
frame #4: + 0x8115e4 (0xfffdcf0915e4 in /usr/local/python3.10.17/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so)
frame #5: + 0x813814 (0xfffdcf093814 in /usr/local/python3.10.17/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so)
frame #6: + 0x810184 (0xfffdcf090184 in /usr/local/python3.10.17/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so)
frame #7: + 0x4c9e4c (0xffff78099e4c in /usr/local/python3.10.17/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: + 0x7d5b8 (0xffff82b8d5b8 in /lib/aarch64-linux-gnu/libc.so.6)
frame #9: + 0xe5edc (0xffff82bf5edc in /lib/aarch64-linux-gnu/libc.so.6)

ERROR 08-04 15:43:37 [dump_input.py:69] Dumping input data for V1 LLM engine (v0.10.1.dev147+gb18b417fb) with config: model='/model', speculative_config=None, tokenizer='/model', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=65536, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=npu, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/model, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"/root/.cache/vllm/torch_compile_cache/e536b67d21","backend":"","custom_ops":["all"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.unified_ascend_attention_with_output","vllm.unified_ascend_attention_with_output","vllm.unified_ascend_attention_with_output"],"use_inductor":false,"compile_sizes":[],"inductor_compile_config":{},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,488,480,464,456,440,432,416,408,392,384,368,360,344,336,328,312,304,288,280,264,256,240,232,216,208,192,184,168,160,152,136,128,112,104,88,80,64,56,40,32,16,8,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":"/root/.cache/vllm/torch_compile_cache/e536b67d21/rank_0_0/backbone"},
ERROR 08-04 15:43:37 [dump_input.py:76] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=chatcmpl-6754c1c9ecea4290956e8683c4c9f04a,prompt_token_ids_len=5402,mm_inputs=[{'pixel_values_videos': tensor([[-0.0986, -0.1133, -0.1572, ..., -1.2812, -1.4531, -1.4531],
ERROR 08-04 15:43:37 [dump_input.py:76] [-0.3027, -0.2891, -0.4492, ..., -1.3672, -1.2812, -1.4531],
ERROR 08-04 15:43:37 [dump_input.py:76] [-0.1865, -0.1865, -0.2012, ..., -1.3828, -1.4766, -1.4688],
ERROR 08-04 15:43:37 [dump_input.py:76] ...,
ERROR 08-04 15:43:37 [dump_input.py:76] [-0.3184, -0.3184, -0.3184, ..., -1.4766, -1.4766, -1.4766],
ERROR 08-04 15:43:37 [dump_input.py:76] [-0.3770, -0.3613, -0.3184, ..., -1.4766, -1.4766, -1.4766],
ERROR 08-04 15:43:37 [dump_input.py:76] [-0.4062, -0.4062, -0.3613, ..., -1.4531, -1.4531, -1.4531]],
ERROR 08-04 15:43:37 [dump_input.py:76] dtype=torch.bfloat16), 'video_grid_thw': tensor([[ 1, 78, 138]])}],mm_hashes=['f65be65580c878750e77fbb9f1f14e2b5a402b9d0bed68644e3b1a4828c3a330'],mm_positions=[PlaceholderRange(offset=4, length=5390, is_embed=tensor([False, False, True, ..., False, False, False]))],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[151336, 151338, 151348], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=65521, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None),block_ids=([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16],),num_computed_tokens=0,lora_request=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_from_preemption=[], new_token_ids=[], new_block_ids=[], num_computed_tokens=[]), num_scheduled_tokens={chatcmpl-6754c1c9ecea4290956e8683c4c9f04a: 2048}, total_num_scheduled_tokens=2048, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={chatcmpl-6754c1c9ecea4290956e8683c4c9f04a: [0]}, num_common_prefix_blocks=[16], finished_req_ids=[], free_encoder_input_ids=[], structured_output_request_ids={}, grammar_bitmask=null, kv_connector_metadata=null)
ERROR 08-04 15:43:37 [dump_input.py:79] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, kv_cache_usage=0.0026372944461681147, prefix_cache_stats=PrefixCacheStats(reset=False, requests=1, queries=5402, hits=0), spec_decoding_stats=None, num_corrupted_reqs=0)
ERROR 08-04 15:43:37 [core.py:649] EngineCore encountered a fatal error.
ERROR 08-04 15:43:37 [core.py:649] Traceback (most recent call last):
ERROR 08-04 15:43:37 [core.py:649] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 640, in run_engine_core
ERROR 08-04 15:43:37 [core.py:649] engine_core.run_busy_loop()
ERROR 08-04 15:43:37 [core.py:649] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 667, in run_busy_loop
ERROR 08-04 15:43:37 [core.py:649] self._process_engine_step()
ERROR 08-04 15:43:37 [core.py:649] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 692, in _process_engine_step
ERROR 08-04 15:43:37 [core.py:649] outputs, model_executed = self.step_fn()
ERROR 08-04 15:43:37 [core.py:649] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 281, in step
ERROR 08-04 15:43:37 [core.py:649] model_output = self.execute_model_with_error_logging(
ERROR 08-04 15:43:37 [core.py:649] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 267, in execute_model_with_error_logging
ERROR 08-04 15:43:37 [core.py:649] raise err
ERROR 08-04 15:43:37 [core.py:649] File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 258, in execute_model_with_error_logging
ERROR 08-04 15:43:37 [core.py:649] return model_fn(scheduler_output)
ERROR 08-04 15:43:37 [core.py:649] File "/vllm-workspace/vllm/vllm/v1/executor/abstract.py", line 87, in execute_model
ERROR 08-04 15:43:37 [core.py:649] output = self.collective_rpc("execute_model",
ERROR 08-04 15:43:37 [core.py:649] File "/vllm-workspace/vllm/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
ERROR 08-04 15:43:37 [core.py:649] answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 08-04 15:43:37 [core.py:649] File "/vllm-workspace/vllm/vllm/utils/init.py", line 2987, in run_method
ERROR 08-04 15:43:37 [core.py:649] return func(*args, **kwargs)
ERROR 08-04 15:43:37 [core.py:649] File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/worker_v1.py", line 201, in execute_model
ERROR 08-04 15:43:37 [core.py:649] output = self.model_runner.execute_model(scheduler_output,
ERROR 08-04 15:43:37 [core.py:649] File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 08-04 15:43:37 [core.py:649] return func(*args, **kwargs)
ERROR 08-04 15:43:37 [core.py:649] File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 1534, in execute_model
ERROR 08-04 15:43:37 [core.py:649] finished_recving) = (self._process_reqs(scheduler_output,
ERROR 08-04 15:43:37 [core.py:649] File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 1158, in _process_reqs
ERROR 08-04 15:43:37 [core.py:649] mm_embeds = self._gather_mm_embeddings(scheduler_output)
ERROR 08-04 15:43:37 [core.py:649] File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 959, in _gather_mm_embeddings
ERROR 08-04 15:43:37 [core.py:649] mm_embeds_item = gather_mm_placeholders(
ERROR 08-04 15:43:37 [core.py:649] File "/vllm-workspace/vllm/vllm/v1/worker/utils.py", line 83, in gather_mm_placeholders
ERROR 08-04 15:43:37 [core.py:649] return placeholders[is_embed]
ERROR 08-04 15:43:37 [core.py:649] RuntimeError: The Inner error is reported as above. The process exits for this inner error, and the current working operator name is aclnnIndexPutImpl.
ERROR 08-04 15:43:37 [core.py:649] Since the operator is called asynchronously, the stacktrace may be inaccurate. If you want to get the accurate stacktrace, pleace set the environment variable ASCEND_LAUNCH_BLOCKING=1.
ERROR 08-04 15:43:37 [core.py:649] Note: ASCEND_LAUNCH_BLOCKING=1 will force ops to run in synchronous mode, resulting in performance degradation. Please unset ASCEND_LAUNCH_BLOCKING in time after debugging.
ERROR 08-04 15:43:37 [core.py:649] [ERROR] 2025-08-04-15:43:37 (PID:3757, Device:0, RankID:-1) ERR00100 PTA call acl api failed.
ERROR 08-04 15:43:37 [core.py:649]
ERROR 08-04 15:43:37 [async_llm.py:420] AsyncLLM output_handler failed.
ERROR 08-04 15:43:37 [async_llm.py:420] Traceback (most recent call last):
ERROR 08-04 15:43:37 [async_llm.py:420] File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 379, in output_handler
ERROR 08-04 15:43:37 [async_llm.py:420] outputs = await engine_core.get_output_async()
ERROR 08-04 15:43:37 [async_llm.py:420] File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 805, in get_output_async
ERROR 08-04 15:43:37 [async_llm.py:420] raise self._format_exception(outputs) from None
ERROR 08-04 15:43:37 [async_llm.py:420] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
INFO 08-04 15:43:37 [async_llm.py:346] Request chatcmpl-6754c1c9ecea4290956e8683c4c9f04a failed (engine dead).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions