[Bug]: GLM4.1v-9b-thinking support video

### Your current environment

<details>
<summary>unsupport video for GLM4.1v-thinking`</summary>

```text
using vllm-ascend==0.9.2rc2.dev131+ge38fab0.d20250804
vllm==0.10.1.dev147+gb18b417fb.empty

 vllm serve /model --limit-mm-per-prompt '{"image":32}' --allowed-local-media-path / --port 8080  --gpu-memory-utilization 0.85



curl --location 'http://127.0.0.1:xxxx/v1/chat/completions' --header 'Content-Type: application/json' --data '{
    "messages": [                                                                                                                                                            {                                                                                                                                                                        "role": "user",                                                               
            "content": [
                {
                    "type": "video_url",
                    "video_url": {
                        "url": "file:///xxx.mp4"
                    }
                },
                {
                    "type": "text",
                    "text": "这个视频在呈现什么画面"
                }
            ]
        }
    ],
    "model": "/model",
    "debug": false,
    "stream": false
}'

```
WARNING 08-04 15:22:44 [glm4_1v.py:1090] Total frames in metadata (776) does not match the length of video array 32. This can be because the video is resampled in advance. This may cause a divergence with HF implementation.
INFO 08-04 15:22:44 [async_llm.py:273] Added request chatcmpl-97bdd1e7c25b467e85fec6d3c6ea34c2.
..[rank0]:[E804 15:23:20.853877511 compiler_depend.ts:429] call aclnnIndexPutImpl failed, detail:E89999: Inner Error!
E89999: [PID: 625] 2025-08-04-15:23:20.869.498 op[Fusion], input shapes [2691] cannot broadcast to shape [5382][FUNC:AddShape][FILE:fusion.cc][LINE:79]
        TraceBack (most recent call last):
       op[BroadcastTo], add input shapes failed[FUNC:CompletedShapes][FILE:broadcast_v3.cc][LINE:1686]
       Autotiling func failed[FUNC:AutoTilingRun][FILE:auto_tiling_rt2.cc][LINE:109]
       op[BroadcastTo], call DoTiling failed[FUNC:Tiling4BroadcastTo][FILE:broadcastto.cc][LINE:104]
       Tiling failed
       Tiling Failed.
       Kernel Run failed. opType: 37, BroadcastTo
       launch failed for BroadcastTo, errno:561103.

[ERROR] 2025-08-04-15:23:20 (PID:625, Device:0, RankID:-1) ERR01100 OPS call acl api failed
Exception raised from operator() at build/CMakeFiles/torch_npu.dir/compiler_depend.ts:73 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0xb8 (0xffff7f26c908 in /usr/local/python3.10.17/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x6c (0xffff7f21b404 in /usr/local/python3.10.17/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0xffd5f8 (0xfffdd6a8d5f8 in /usr/local/python3.10.17/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so)
frame #3: <unknown function> + 0x192b4e0 (0xfffdd73bb4e0 in /usr/local/python3.10.17/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so)
frame #4: <unknown function> + 0x8115e4 (0xfffdd62a15e4 in /usr/local/python3.10.17/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so)
frame #5: <unknown function> + 0x813814 (0xfffdd62a3814 in /usr/local/python3.10.17/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so)
frame #6: <unknown function> + 0x810184 (0xfffdd62a0184 in /usr/local/python3.10.17/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so)
frame #7: <unknown function> + 0x4c9e4c (0xffff7f2a9e4c in /usr/local/python3.10.17/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: <unknown function> + 0x7d5b8 (0xffff89d9d5b8 in /lib/aarch64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0xe5edc (0xffff89e05edc in /lib/aarch64-linux-gnu/libc.so.6)

ERROR 08-04 15:23:20 [dump_input.py:69] Dumping input data for V1 LLM engine (v0.10.1.dev147+gb18b417fb) with config: model='/model', speculative_config=None, tokenizer='/model', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=65536, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=npu, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/model, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"/root/.cache/vllm/torch_compile_cache/e536b67d21","backend":"","custom_ops":["all"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.unified_ascend_attention_with_output","vllm.unified_ascend_attention_with_output","vllm.unified_ascend_attention_with_output"],"use_inductor":false,"compile_sizes":[],"inductor_compile_config":{},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,488,480,464,456,440,432,416,408,392,384,368,360,344,336,328,312,304,288,280,264,256,240,232,216,208,192,184,168,160,152,136,128,112,104,88,80,64,56,40,32,16,8,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":"/root/.cache/vllm/torch_compile_cache/e536b67d21/rank_0_0/backbone"}, 
ERROR 08-04 15:23:20 [dump_input.py:76] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=chatcmpl-97bdd1e7c25b467e85fec6d3c6ea34c2,prompt_token_ids_len=5402,mm_inputs=[{'video_grid_thw': tensor([[  1,  78, 138]]), 'pixel_values_videos': tensor([[-0.0986, -0.1133, -0.1572,  ..., -1.2812, -1.4531, -1.4531],
ERROR 08-04 15:23:20 [dump_input.py:76]         [-0.3027, -0.2891, -0.4492,  ..., -1.3672, -1.2812, -1.4531],
ERROR 08-04 15:23:20 [dump_input.py:76]         [-0.1865, -0.1865, -0.2012,  ..., -1.3828, -1.4766, -1.4688],
ERROR 08-04 15:23:20 [dump_input.py:76]         ...,
ERROR 08-04 15:23:20 [dump_input.py:76]         [-0.3184, -0.3184, -0.3184,  ..., -1.4766, -1.4766, -1.4766],
ERROR 08-04 15:23:20 [dump_input.py:76]         [-0.3770, -0.3613, -0.3184,  ..., -1.4766, -1.4766, -1.4766],
ERROR 08-04 15:23:20 [dump_input.py:76]         [-0.4062, -0.4062, -0.3613,  ..., -1.4531, -1.4531, -1.4531]],
ERROR 08-04 15:23:20 [dump_input.py:76]        dtype=torch.bfloat16)}],mm_hashes=['f65be65580c878750e77fbb9f1f14e2b5a402b9d0bed68644e3b1a4828c3a330'],mm_positions=[PlaceholderRange(offset=4, length=5390, is_embed=tensor([False, False,  True,  ..., False, False, False]))],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[151336, 151338, 151348], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=65521, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None),block_ids=([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16],),num_computed_tokens=0,lora_request=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_from_preemption=[], new_token_ids=[], new_block_ids=[], num_computed_tokens=[]), num_scheduled_tokens={chatcmpl-97bdd1e7c25b467e85fec6d3c6ea34c2: 2048}, total_num_scheduled_tokens=2048, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={chatcmpl-97bdd1e7c25b467e85fec6d3c6ea34c2: [0]}, num_common_prefix_blocks=[16], finished_req_ids=[], free_encoder_input_ids=[], structured_output_request_ids={}, grammar_bitmask=null, kv_connector_metadata=null)
ERROR 08-04 15:23:20 [dump_input.py:79] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, kv_cache_usage=0.0026372944461681147, prefix_cache_stats=PrefixCacheStats(reset=False, requests=1, queries=5402, hits=0), spec_decoding_stats=None, num_corrupted_reqs=0)
ERROR 08-04 15:23:20 [core.py:649] EngineCore encountered a fatal error.
ERROR 08-04 15:23:20 [core.py:649] Traceback (most recent call last):
ERROR 08-04 15:23:20 [core.py:649]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 640, in run_engine_core
ERROR 08-04 15:23:20 [core.py:649]     engine_core.run_busy_loop()
ERROR 08-04 15:23:20 [core.py:649]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 667, in run_busy_loop
ERROR 08-04 15:23:20 [core.py:649]     self._process_engine_step()
ERROR 08-04 15:23:20 [core.py:649]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 692, in _process_engine_step
ERROR 08-04 15:23:20 [core.py:649]     outputs, model_executed = self.step_fn()
ERROR 08-04 15:23:20 [core.py:649]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 281, in step
ERROR 08-04 15:23:20 [core.py:649]     model_output = self.execute_model_with_error_logging(
ERROR 08-04 15:23:20 [core.py:649]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 267, in execute_model_with_error_logging
ERROR 08-04 15:23:20 [core.py:649]     raise err
ERROR 08-04 15:23:20 [core.py:649]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 258, in execute_model_with_error_logging
ERROR 08-04 15:23:20 [core.py:649]     return model_fn(scheduler_output)
ERROR 08-04 15:23:20 [core.py:649]   File "/vllm-workspace/vllm/vllm/v1/executor/abstract.py", line 87, in execute_model
ERROR 08-04 15:23:20 [core.py:649]     output = self.collective_rpc("execute_model",
ERROR 08-04 15:23:20 [core.py:649]   File "/vllm-workspace/vllm/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
ERROR 08-04 15:23:20 [core.py:649]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 08-04 15:23:20 [core.py:649]   File "/vllm-workspace/vllm/vllm/utils/__init__.py", line 2987, in run_method
ERROR 08-04 15:23:20 [core.py:649]     return func(*args, **kwargs)
ERROR 08-04 15:23:20 [core.py:649]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/worker_v1.py", line 201, in execute_model
ERROR 08-04 15:23:20 [core.py:649]     output = self.model_runner.execute_model(scheduler_output,
ERROR 08-04 15:23:20 [core.py:649]   File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 08-04 15:23:20 [core.py:649]     return func(*args, **kwargs)
ERROR 08-04 15:23:20 [core.py:649]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 1534, in execute_model
ERROR 08-04 15:23:20 [core.py:649]     finished_recving) = (self._process_reqs(scheduler_output,
ERROR 08-04 15:23:20 [core.py:649]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 1158, in _process_reqs
ERROR 08-04 15:23:20 [core.py:649]     mm_embeds = self._gather_mm_embeddings(scheduler_output)
ERROR 08-04 15:23:20 [core.py:649]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 959, in _gather_mm_embeddings
ERROR 08-04 15:23:20 [core.py:649]     mm_embeds_item = gather_mm_placeholders(
ERROR 08-04 15:23:20 [core.py:649]   File "/vllm-workspace/vllm/vllm/v1/worker/utils.py", line 83, in gather_mm_placeholders
ERROR 08-04 15:23:20 [core.py:649]     return placeholders[is_embed]
ERROR 08-04 15:23:20 [core.py:649] RuntimeError: The Inner error is reported as above. The process exits for this inner error, and the current working operator name is aclnnIndexPutImpl.
ERROR 08-04 15:23:20 [core.py:649] Since the operator is called asynchronously, the stacktrace may be inaccurate. If you want to get the accurate stacktrace, pleace set the environment variable ASCEND_LAUNCH_BLOCKING=1.
ERROR 08-04 15:23:20 [core.py:649] Note: ASCEND_LAUNCH_BLOCKING=1 will force ops to run in synchronous mode, resulting in performance degradation. Please unset ASCEND_LAUNCH_BLOCKING in time after debugging.
ERROR 08-04 15:23:20 [core.py:649] [ERROR] 2025-08-04-15:23:20 (PID:625, Device:0, RankID:-1) ERR00100 PTA call acl api failed.
ERROR 08-04 15:23:20 [core.py:649] 
ERROR 08-04 15:23:20 [async_llm.py:420] AsyncLLM output_handler failed.
ERROR 08-04 15:23:20 [async_llm.py:420] Traceback (most recent call last):
ERROR 08-04 15:23:20 [async_llm.py:420]   File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 379, in output_handler
ERROR 08-04 15:23:20 [async_llm.py:420]     outputs = await engine_core.get_output_async()
ERROR 08-04 15:23:20 [async_llm.py:420]   File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 805, in get_output_async
ERROR 08-04 15:23:20 [async_llm.py:420]     raise self._format_exception(outputs) from None
ERROR 08-04 15:23:20 [async_llm.py:420] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
INFO 08-04 15:23:20 [async_llm.py:346] Request chatcmpl-97bdd1e7c25b467e85fec6d3c6ea34c2 failed (engine dead).
</details>


### 🐛 Describe the bug

[rank0]:[E804 15:43:37.334754499 compiler_depend.ts:429] call aclnnIndexPutImpl failed, detail:E89999: Inner Error!
E89999: [PID: 3757] 2025-08-04-15:43:37.355.061 op[Fusion], input shapes [2691] cannot broadcast to shape [5382][FUNC:AddShape][FILE:fusion.cc][LINE:79]
        TraceBack (most recent call last):
       op[BroadcastTo], add input shapes failed[FUNC:CompletedShapes][FILE:broadcast_v3.cc][LINE:1686]
       Autotiling func failed[FUNC:AutoTilingRun][FILE:auto_tiling_rt2.cc][LINE:109]
       op[BroadcastTo], call DoTiling failed[FUNC:Tiling4BroadcastTo][FILE:broadcastto.cc][LINE:104]
       Tiling failed
       Tiling Failed.
       Kernel Run failed. opType: 37, BroadcastTo
       launch failed for BroadcastTo, errno:561103.

[ERROR] 2025-08-04-15:43:37 (PID:3757, Device:0, RankID:-1) ERR01100 OPS call acl api failed
Exception raised from operator() at build/CMakeFiles/torch_npu.dir/compiler_depend.ts:73 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0xb8 (0xffff7805c908 in /usr/local/python3.10.17/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x6c (0xffff7800b404 in /usr/local/python3.10.17/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0xffd5f8 (0xfffdcf87d5f8 in /usr/local/python3.10.17/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so)
frame #3: <unknown function> + 0x192b4e0 (0xfffdd01ab4e0 in /usr/local/python3.10.17/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so)
frame #4: <unknown function> + 0x8115e4 (0xfffdcf0915e4 in /usr/local/python3.10.17/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so)
frame #5: <unknown function> + 0x813814 (0xfffdcf093814 in /usr/local/python3.10.17/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so)
frame #6: <unknown function> + 0x810184 (0xfffdcf090184 in /usr/local/python3.10.17/lib/python3.10/site-packages/torch_npu/lib/libtorch_npu.so)
frame #7: <unknown function> + 0x4c9e4c (0xffff78099e4c in /usr/local/python3.10.17/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: <unknown function> + 0x7d5b8 (0xffff82b8d5b8 in /lib/aarch64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0xe5edc (0xffff82bf5edc in /lib/aarch64-linux-gnu/libc.so.6)

ERROR 08-04 15:43:37 [dump_input.py:69] Dumping input data for V1 LLM engine (v0.10.1.dev147+gb18b417fb) with config: model='/model', speculative_config=None, tokenizer='/model', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=65536, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=npu, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/model, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"/root/.cache/vllm/torch_compile_cache/e536b67d21","backend":"","custom_ops":["all"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.unified_ascend_attention_with_output","vllm.unified_ascend_attention_with_output","vllm.unified_ascend_attention_with_output"],"use_inductor":false,"compile_sizes":[],"inductor_compile_config":{},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,488,480,464,456,440,432,416,408,392,384,368,360,344,336,328,312,304,288,280,264,256,240,232,216,208,192,184,168,160,152,136,128,112,104,88,80,64,56,40,32,16,8,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":"/root/.cache/vllm/torch_compile_cache/e536b67d21/rank_0_0/backbone"}, 
ERROR 08-04 15:43:37 [dump_input.py:76] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=chatcmpl-6754c1c9ecea4290956e8683c4c9f04a,prompt_token_ids_len=5402,mm_inputs=[{'pixel_values_videos': tensor([[-0.0986, -0.1133, -0.1572,  ..., -1.2812, -1.4531, -1.4531],
ERROR 08-04 15:43:37 [dump_input.py:76]         [-0.3027, -0.2891, -0.4492,  ..., -1.3672, -1.2812, -1.4531],
ERROR 08-04 15:43:37 [dump_input.py:76]         [-0.1865, -0.1865, -0.2012,  ..., -1.3828, -1.4766, -1.4688],
ERROR 08-04 15:43:37 [dump_input.py:76]         ...,
ERROR 08-04 15:43:37 [dump_input.py:76]         [-0.3184, -0.3184, -0.3184,  ..., -1.4766, -1.4766, -1.4766],
ERROR 08-04 15:43:37 [dump_input.py:76]         [-0.3770, -0.3613, -0.3184,  ..., -1.4766, -1.4766, -1.4766],
ERROR 08-04 15:43:37 [dump_input.py:76]         [-0.4062, -0.4062, -0.3613,  ..., -1.4531, -1.4531, -1.4531]],
ERROR 08-04 15:43:37 [dump_input.py:76]        dtype=torch.bfloat16), 'video_grid_thw': tensor([[  1,  78, 138]])}],mm_hashes=['f65be65580c878750e77fbb9f1f14e2b5a402b9d0bed68644e3b1a4828c3a330'],mm_positions=[PlaceholderRange(offset=4, length=5390, is_embed=tensor([False, False,  True,  ..., False, False, False]))],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[151336, 151338, 151348], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=65521, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None),block_ids=([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16],),num_computed_tokens=0,lora_request=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_from_preemption=[], new_token_ids=[], new_block_ids=[], num_computed_tokens=[]), num_scheduled_tokens={chatcmpl-6754c1c9ecea4290956e8683c4c9f04a: 2048}, total_num_scheduled_tokens=2048, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={chatcmpl-6754c1c9ecea4290956e8683c4c9f04a: [0]}, num_common_prefix_blocks=[16], finished_req_ids=[], free_encoder_input_ids=[], structured_output_request_ids={}, grammar_bitmask=null, kv_connector_metadata=null)
ERROR 08-04 15:43:37 [dump_input.py:79] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, kv_cache_usage=0.0026372944461681147, prefix_cache_stats=PrefixCacheStats(reset=False, requests=1, queries=5402, hits=0), spec_decoding_stats=None, num_corrupted_reqs=0)
ERROR 08-04 15:43:37 [core.py:649] EngineCore encountered a fatal error.
ERROR 08-04 15:43:37 [core.py:649] Traceback (most recent call last):
ERROR 08-04 15:43:37 [core.py:649]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 640, in run_engine_core
ERROR 08-04 15:43:37 [core.py:649]     engine_core.run_busy_loop()
ERROR 08-04 15:43:37 [core.py:649]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 667, in run_busy_loop
ERROR 08-04 15:43:37 [core.py:649]     self._process_engine_step()
ERROR 08-04 15:43:37 [core.py:649]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 692, in _process_engine_step
ERROR 08-04 15:43:37 [core.py:649]     outputs, model_executed = self.step_fn()
ERROR 08-04 15:43:37 [core.py:649]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 281, in step
ERROR 08-04 15:43:37 [core.py:649]     model_output = self.execute_model_with_error_logging(
ERROR 08-04 15:43:37 [core.py:649]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 267, in execute_model_with_error_logging
ERROR 08-04 15:43:37 [core.py:649]     raise err
ERROR 08-04 15:43:37 [core.py:649]   File "/vllm-workspace/vllm/vllm/v1/engine/core.py", line 258, in execute_model_with_error_logging
ERROR 08-04 15:43:37 [core.py:649]     return model_fn(scheduler_output)
ERROR 08-04 15:43:37 [core.py:649]   File "/vllm-workspace/vllm/vllm/v1/executor/abstract.py", line 87, in execute_model
ERROR 08-04 15:43:37 [core.py:649]     output = self.collective_rpc("execute_model",
ERROR 08-04 15:43:37 [core.py:649]   File "/vllm-workspace/vllm/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
ERROR 08-04 15:43:37 [core.py:649]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 08-04 15:43:37 [core.py:649]   File "/vllm-workspace/vllm/vllm/utils/__init__.py", line 2987, in run_method
ERROR 08-04 15:43:37 [core.py:649]     return func(*args, **kwargs)
ERROR 08-04 15:43:37 [core.py:649]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/worker_v1.py", line 201, in execute_model
ERROR 08-04 15:43:37 [core.py:649]     output = self.model_runner.execute_model(scheduler_output,
ERROR 08-04 15:43:37 [core.py:649]   File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 08-04 15:43:37 [core.py:649]     return func(*args, **kwargs)
ERROR 08-04 15:43:37 [core.py:649]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 1534, in execute_model
ERROR 08-04 15:43:37 [core.py:649]     finished_recving) = (self._process_reqs(scheduler_output,
ERROR 08-04 15:43:37 [core.py:649]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 1158, in _process_reqs
ERROR 08-04 15:43:37 [core.py:649]     mm_embeds = self._gather_mm_embeddings(scheduler_output)
ERROR 08-04 15:43:37 [core.py:649]   File "/vllm-workspace/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 959, in _gather_mm_embeddings
ERROR 08-04 15:43:37 [core.py:649]     mm_embeds_item = gather_mm_placeholders(
ERROR 08-04 15:43:37 [core.py:649]   File "/vllm-workspace/vllm/vllm/v1/worker/utils.py", line 83, in gather_mm_placeholders
ERROR 08-04 15:43:37 [core.py:649]     return placeholders[is_embed]
ERROR 08-04 15:43:37 [core.py:649] RuntimeError: The Inner error is reported as above. The process exits for this inner error, and the current working operator name is aclnnIndexPutImpl.
ERROR 08-04 15:43:37 [core.py:649] Since the operator is called asynchronously, the stacktrace may be inaccurate. If you want to get the accurate stacktrace, pleace set the environment variable ASCEND_LAUNCH_BLOCKING=1.
ERROR 08-04 15:43:37 [core.py:649] Note: ASCEND_LAUNCH_BLOCKING=1 will force ops to run in synchronous mode, resulting in performance degradation. Please unset ASCEND_LAUNCH_BLOCKING in time after debugging.
ERROR 08-04 15:43:37 [core.py:649] [ERROR] 2025-08-04-15:43:37 (PID:3757, Device:0, RankID:-1) ERR00100 PTA call acl api failed.
ERROR 08-04 15:43:37 [core.py:649] 
ERROR 08-04 15:43:37 [async_llm.py:420] AsyncLLM output_handler failed.
ERROR 08-04 15:43:37 [async_llm.py:420] Traceback (most recent call last):
ERROR 08-04 15:43:37 [async_llm.py:420]   File "/vllm-workspace/vllm/vllm/v1/engine/async_llm.py", line 379, in output_handler
ERROR 08-04 15:43:37 [async_llm.py:420]     outputs = await engine_core.get_output_async()
ERROR 08-04 15:43:37 [async_llm.py:420]   File "/vllm-workspace/vllm/vllm/v1/engine/core_client.py", line 805, in get_output_async
ERROR 08-04 15:43:37 [async_llm.py:420]     raise self._format_exception(outputs) from None
ERROR 08-04 15:43:37 [async_llm.py:420] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
INFO 08-04 15:43:37 [async_llm.py:346] Request chatcmpl-6754c1c9ecea4290956e8683c4c9f04a failed (engine dead).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: GLM4.1v-9b-thinking support video #2201

Your current environment

🐛 Describe the bug

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: GLM4.1v-9b-thinking support video #2201

Description

Your current environment

🐛 Describe the bug

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions