Skip to content

[Bug]: run glm4.1v ,ValueError: Attempted to assign 100 = 100 multimodal tokens to 30000 placeholders #21516

@leizhu1989

Description

@leizhu1989

Your current environment

os: ubuntu20.4
env: conda
cuda: 12.8 and T4 *2
vllm: 0.9.2
transformer: 4.53.3

🐛 Describe the bug

vllm serve /mnt/vdb/project/glm4.1v-model --tensor-parallel-size=2 --served-model-name=ui-tars
INFO 07-24 17:34:54 [init.py:244] Automatically detected platform cuda.
INFO 07-24 17:34:57 [api_server.py:1395] vLLM API server version 0.9.2
INFO 07-24 17:34:57 [cli_args.py:325] non-default args: {'model': '/mnt/vdb/project/glm4.1v-model', 'served_model_name': ['ui-tars'], 'tensor_parallel_size': 2}
INFO 07-24 17:35:04 [config.py:841] This model supports multiple tasks: {'embed', 'reward', 'classify', 'generate'}. Defaulting to 'generate'.
WARNING 07-24 17:35:04 [config.py:3320] Your device 'Tesla T4' (with compute capability 7.5) doesn't support torch.bfloat16. Falling back to torch.float16 for compatibility.
WARNING 07-24 17:35:04 [config.py:3371] Casting torch.bfloat16 to torch.float16.
INFO 07-24 17:35:04 [config.py:1472] Using max model len 65536
WARNING 07-24 17:35:04 [arg_utils.py:1735] Compute Capability < 8.0 is not supported by the V1 Engine. Falling back to V0.
WARNING 07-24 17:35:04 [arg_utils.py:1542] The model has a long context length (65536). This may causeOOM during the initial memory profiling phase, or result in low performance due to small KV cache size. Consider setting --max-model-len to a smaller value.
INFO 07-24 17:35:04 [api_server.py:268] Started engine process with PID 831609
INFO 07-24 17:35:08 [init.py:244] Automatically detected platform cuda.
INFO 07-24 17:35:10 [llm_engine.py:230] Initializing a V0 LLM engine (v0.9.2) with config: model='/mnt/vdb/project/glm4.1v-model', speculative_config=None, tokenizer='/mnt/vdb/project/glm4.1v-model', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=65536, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=ui-tars, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":256,"local_cache_dir":null}, use_cached_outputs=True,
WARNING 07-24 17:35:11 [multiproc_worker_utils.py:307] Reducing Torch parallelism from 32 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 07-24 17:35:11 [cuda.py:311] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 07-24 17:35:11 [cuda.py:360] Using XFormers backend.
INFO 07-24 17:35:15 [init.py:244] Automatically detected platform cuda.
(VllmWorkerProcess pid=831706) INFO 07-24 17:35:17 [multiproc_worker_utils.py:226] Worker ready; awaiting tasks
(VllmWorkerProcess pid=831706) INFO 07-24 17:35:17 [cuda.py:311] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=831706) INFO 07-24 17:35:17 [cuda.py:360] Using XFormers backend.
(VllmWorkerProcess pid=831706) INFO 07-24 17:35:18 [init.py:1152] Found nccl from library libnccl.so.2
INFO 07-24 17:35:18 [init.py:1152] Found nccl from library libnccl.so.2
INFO 07-24 17:35:18 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorkerProcess pid=831706) INFO 07-24 17:35:18 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorkerProcess pid=831706) INFO 07-24 17:35:19 [custom_all_reduce_utils.py:246] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorkerProcess pid=831706) WARNING 07-24 17:35:19 [custom_all_reduce.py:147] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 07-24 17:35:19 [custom_all_reduce_utils.py:246] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
WARNING 07-24 17:35:19 [custom_all_reduce.py:147] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 07-24 17:35:19 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_127f075d'), local_subscribe_addr='ipc:///tmp/f43d87c5-a9b9-4814-aaf2-51aa6645eb37', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 07-24 17:35:19 [parallel_state.py:1076] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(VllmWorkerProcess pid=831706) INFO 07-24 17:35:19 [parallel_state.py:1076] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
INFO 07-24 17:35:19 [model_runner.py:1171] Starting to load model /mnt/vdb/project/glm4.1v-model...
(VllmWorkerProcess pid=831706) INFO 07-24 17:35:19 [model_runner.py:1171] Starting to load model /mnt/vdb/project/glm4.1v-model...
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:03<00:10, 3.52s/it]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:07<00:07, 3.59s/it]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:12<00:04, 4.58s/it]
(VllmWorkerProcess pid=831706) INFO 07-24 17:35:37 [default_loader.py:272] Loading weights took 17.42 seconds
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:17<00:00, 4.75s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:17<00:00, 4.48s/it]

INFO 07-24 17:35:37 [default_loader.py:272] Loading weights took 18.04 seconds
(VllmWorkerProcess pid=831706) INFO 07-24 17:35:37 [model_runner.py:1203] Model loading took 9.7069 GiB and 17.758631 seconds
INFO 07-24 17:35:38 [model_runner.py:1203] Model loading took 9.7069 GiB and 18.381705 seconds
Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False.
(VllmWorkerProcess pid=831706) Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False.
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks.
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] Traceback (most recent call last):
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/executor/multiproc_worker_utils.py", line 233, in _run_worker_process
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] output = run_method(worker, method, args, kwargs)
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/utils/init.py", line 2736, in run_method
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] return func(*args, **kwargs)
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] return func(*args, **kwargs)
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/worker/worker.py", line 256, in determine_num_available_blocks
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] self.model_runner.profile_run()
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] return func(*args, **kwargs)
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1300, in profile_run
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] self._dummy_run(max_num_batched_tokens, max_num_seqs)
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1426, in _dummy_run
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] self.execute_model(model_input, kv_caches, intermediate_tensors)
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] return func(*args, **kwargs)
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1844, in execute_model
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] ^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/model_executor/models/glm4_1v.py", line 1555, in forward
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] inputs_embeds = self.get_input_embeddings_v0(
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/model_executor/models/glm4_1v.py", line 1500, in get_input_embeddings_v0
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] inputs_embeds = merge_multimodal_embeddings(
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/model_executor/models/utils.py", line 511, in merge_multimodal_embeddings
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] return _merge_multimodal_embeddings(
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/model_executor/models/utils.py", line 427, in _merge_multimodal_embeddings
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] raise ValueError(
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] ValueError: Attempted to assign 100 = 100 multimodal tokens to 30000 placeholders
ERROR 07-24 17:36:48 [engine.py:458] Attempted to assign 100 = 100 multimodal tokens to 30000 placeholders
ERROR 07-24 17:36:48 [engine.py:458] Traceback (most recent call last):
ERROR 07-24 17:36:48 [engine.py:458] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 446, in run_mp_engine
ERROR 07-24 17:36:48 [engine.py:458] engine = MQLLMEngine.from_vllm_config(
ERROR 07-24 17:36:48 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 17:36:48 [engine.py:458] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 133, in from_vllm_config
ERROR 07-24 17:36:48 [engine.py:458] return cls(
ERROR 07-24 17:36:48 [engine.py:458] ^^^^
ERROR 07-24 17:36:48 [engine.py:458] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 87, in init
ERROR 07-24 17:36:48 [engine.py:458] self.engine = LLMEngine(*args, **kwargs)
ERROR 07-24 17:36:48 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 17:36:48 [engine.py:458] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 268, in init
ERROR 07-24 17:36:48 [engine.py:458] self._initialize_kv_caches()
ERROR 07-24 17:36:48 [engine.py:458] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 413, in _initialize_kv_caches
ERROR 07-24 17:36:48 [engine.py:458] self.model_executor.determine_num_available_blocks())
ERROR 07-24 17:36:48 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 17:36:48 [engine.py:458] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 104, in determine_num_available_blocks
ERROR 07-24 17:36:48 [engine.py:458] results = self.collective_rpc("determine_num_available_blocks")
ERROR 07-24 17:36:48 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 17:36:48 [engine.py:458] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 332, in collective_rpc
ERROR 07-24 17:36:48 [engine.py:458] return self._run_workers(method, *args, **(kwargs or {}))
ERROR 07-24 17:36:48 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 17:36:48 [engine.py:458] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/executor/mp_distributed_executor.py", line 186, in _run_workers
ERROR 07-24 17:36:48 [engine.py:458] driver_worker_output = run_method(self.driver_worker, sent_method,
ERROR 07-24 17:36:48 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 17:36:48 [engine.py:458] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/utils/init.py", line 2736, in run_method
ERROR 07-24 17:36:48 [engine.py:458] return func(*args, **kwargs)
ERROR 07-24 17:36:48 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 17:36:48 [engine.py:458] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 07-24 17:36:48 [engine.py:458] return func(*args, **kwargs)
ERROR 07-24 17:36:48 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 17:36:48 [engine.py:458] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/worker/worker.py", line 256, in determine_num_available_blocks
ERROR 07-24 17:36:48 [engine.py:458] self.model_runner.profile_run()
ERROR 07-24 17:36:48 [engine.py:458] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 07-24 17:36:48 [engine.py:458] return func(*args, **kwargs)
ERROR 07-24 17:36:48 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 17:36:48 [engine.py:458] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1300, in profile_run
ERROR 07-24 17:36:48 [engine.py:458] self._dummy_run(max_num_batched_tokens, max_num_seqs)
ERROR 07-24 17:36:48 [engine.py:458] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1426, in _dummy_run
ERROR 07-24 17:36:48 [engine.py:458] self.execute_model(model_input, kv_caches, intermediate_tensors)
ERROR 07-24 17:36:48 [engine.py:458] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 07-24 17:36:48 [engine.py:458] return func(*args, **kwargs)
ERROR 07-24 17:36:48 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 17:36:48 [engine.py:458] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1844, in execute_model
ERROR 07-24 17:36:48 [engine.py:458] hidden_or_intermediate_states = model_executable(
ERROR 07-24 17:36:48 [engine.py:458] ^^^^^^^^^^^^^^^^^
ERROR 07-24 17:36:48 [engine.py:458] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 07-24 17:36:48 [engine.py:458] return self._call_impl(*args, **kwargs)
ERROR 07-24 17:36:48 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 17:36:48 [engine.py:458] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 07-24 17:36:48 [engine.py:458] return forward_call(*args, **kwargs)
ERROR 07-24 17:36:48 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 17:36:48 [engine.py:458] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/model_executor/models/glm4_1v.py", line 1555, in forward
ERROR 07-24 17:36:48 [engine.py:458] inputs_embeds = self.get_input_embeddings_v0(
ERROR 07-24 17:36:48 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 17:36:48 [engine.py:458] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/model_executor/models/glm4_1v.py", line 1500, in get_input_embeddings_v0
ERROR 07-24 17:36:48 [engine.py:458] inputs_embeds = merge_multimodal_embeddings(
ERROR 07-24 17:36:48 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 17:36:48 [engine.py:458] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/model_executor/models/utils.py", line 511, in merge_multimodal_embeddings
ERROR 07-24 17:36:48 [engine.py:458] return _merge_multimodal_embeddings(
ERROR 07-24 17:36:48 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 17:36:48 [engine.py:458] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/model_executor/models/utils.py", line 427, in _merge_multimodal_embeddings
ERROR 07-24 17:36:48 [engine.py:458] raise ValueError(
ERROR 07-24 17:36:48 [engine.py:458] ValueError: Attempted to assign 100 = 100 multimodal tokens to 30000 placeholders
INFO 07-24 17:36:48 [multiproc_worker_utils.py:125] Killing local vLLM worker processes
Process SpawnProcess-1:
Traceback (most recent call last):
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 460, in run_mp_engine
raise e from None
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 446, in run_mp_engine
engine = MQLLMEngine.from_vllm_config(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 133, in from_vllm_config
return cls(
^^^^
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 87, in init
self.engine = LLMEngine(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 268, in init
self._initialize_kv_caches()
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 413, in _initialize_kv_caches
self.model_executor.determine_num_available_blocks())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 104, in determine_num_available_blocks
results = self.collective_rpc("determine_num_available_blocks")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 332, in collective_rpc
return self._run_workers(method, *args, **(kwargs or {}))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/executor/mp_distributed_executor.py", line 186, in _run_workers
driver_worker_output = run_method(self.driver_worker, sent_method,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/utils/init.py", line 2736, in run_method
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/worker/worker.py", line 256, in determine_num_available_blocks
self.model_runner.profile_run()
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1300, in profile_run
self._dummy_run(max_num_batched_tokens, max_num_seqs)
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1426, in _dummy_run
self.execute_model(model_input, kv_caches, intermediate_tensors)
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1844, in execute_model
hidden_or_intermediate_states = model_executable(
^^^^^^^^^^^^^^^^^
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/model_executor/models/glm4_1v.py", line 1555, in forward
inputs_embeds = self.get_input_embeddings_v0(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/model_executor/models/glm4_1v.py", line 1500, in get_input_embeddings_v0
inputs_embeds = merge_multimodal_embeddings(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/model_executor/models/utils.py", line 511, in merge_multimodal_embeddings
return _merge_multimodal_embeddings(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/model_executor/models/utils.py", line 427, in _merge_multimodal_embeddings
raise ValueError(
ValueError: Attempted to assign 100 = 100 multimodal tokens to 30000 placeholders
[rank0]:[W724 17:36:49.096568445 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Traceback (most recent call last):
File "/mnt/vdb/anaconda3/envs/glm4.1v/bin/vllm", line 8, in
sys.exit(main())
^^^^^^
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/entrypoints/cli/main.py", line 65, in main
args.dispatch_function(args)
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/entrypoints/cli/serve.py", line 55, in cmd
uvloop.run(run_server(args))
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/uvloop/init.py", line 105, in run
return runner.run(wrapper())
^^^^^^^^^^^^^^^^^^^^^
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/uvloop/init.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 1431, in run_server
await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 1451, in run_server_worker
async with build_async_engine_client(args, client_config) as engine_client:
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 158, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 291, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.
/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions