-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Description
Your current environment
os: ubuntu20.4
env: conda
cuda: 12.8 and T4 *2
vllm: 0.9.2
transformer: 4.53.3
🐛 Describe the bug
vllm serve /mnt/vdb/project/glm4.1v-model --tensor-parallel-size=2 --served-model-name=ui-tars
INFO 07-24 17:34:54 [init.py:244] Automatically detected platform cuda.
INFO 07-24 17:34:57 [api_server.py:1395] vLLM API server version 0.9.2
INFO 07-24 17:34:57 [cli_args.py:325] non-default args: {'model': '/mnt/vdb/project/glm4.1v-model', 'served_model_name': ['ui-tars'], 'tensor_parallel_size': 2}
INFO 07-24 17:35:04 [config.py:841] This model supports multiple tasks: {'embed', 'reward', 'classify', 'generate'}. Defaulting to 'generate'.
WARNING 07-24 17:35:04 [config.py:3320] Your device 'Tesla T4' (with compute capability 7.5) doesn't support torch.bfloat16. Falling back to torch.float16 for compatibility.
WARNING 07-24 17:35:04 [config.py:3371] Casting torch.bfloat16 to torch.float16.
INFO 07-24 17:35:04 [config.py:1472] Using max model len 65536
WARNING 07-24 17:35:04 [arg_utils.py:1735] Compute Capability < 8.0 is not supported by the V1 Engine. Falling back to V0.
WARNING 07-24 17:35:04 [arg_utils.py:1542] The model has a long context length (65536). This may causeOOM during the initial memory profiling phase, or result in low performance due to small KV cache size. Consider setting --max-model-len to a smaller value.
INFO 07-24 17:35:04 [api_server.py:268] Started engine process with PID 831609
INFO 07-24 17:35:08 [init.py:244] Automatically detected platform cuda.
INFO 07-24 17:35:10 [llm_engine.py:230] Initializing a V0 LLM engine (v0.9.2) with config: model='/mnt/vdb/project/glm4.1v-model', speculative_config=None, tokenizer='/mnt/vdb/project/glm4.1v-model', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=65536, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=ui-tars, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":256,"local_cache_dir":null}, use_cached_outputs=True,
WARNING 07-24 17:35:11 [multiproc_worker_utils.py:307] Reducing Torch parallelism from 32 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 07-24 17:35:11 [cuda.py:311] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 07-24 17:35:11 [cuda.py:360] Using XFormers backend.
INFO 07-24 17:35:15 [init.py:244] Automatically detected platform cuda.
(VllmWorkerProcess pid=831706) INFO 07-24 17:35:17 [multiproc_worker_utils.py:226] Worker ready; awaiting tasks
(VllmWorkerProcess pid=831706) INFO 07-24 17:35:17 [cuda.py:311] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=831706) INFO 07-24 17:35:17 [cuda.py:360] Using XFormers backend.
(VllmWorkerProcess pid=831706) INFO 07-24 17:35:18 [init.py:1152] Found nccl from library libnccl.so.2
INFO 07-24 17:35:18 [init.py:1152] Found nccl from library libnccl.so.2
INFO 07-24 17:35:18 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorkerProcess pid=831706) INFO 07-24 17:35:18 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorkerProcess pid=831706) INFO 07-24 17:35:19 [custom_all_reduce_utils.py:246] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorkerProcess pid=831706) WARNING 07-24 17:35:19 [custom_all_reduce.py:147] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 07-24 17:35:19 [custom_all_reduce_utils.py:246] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
WARNING 07-24 17:35:19 [custom_all_reduce.py:147] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 07-24 17:35:19 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_127f075d'), local_subscribe_addr='ipc:///tmp/f43d87c5-a9b9-4814-aaf2-51aa6645eb37', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 07-24 17:35:19 [parallel_state.py:1076] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(VllmWorkerProcess pid=831706) INFO 07-24 17:35:19 [parallel_state.py:1076] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
INFO 07-24 17:35:19 [model_runner.py:1171] Starting to load model /mnt/vdb/project/glm4.1v-model...
(VllmWorkerProcess pid=831706) INFO 07-24 17:35:19 [model_runner.py:1171] Starting to load model /mnt/vdb/project/glm4.1v-model...
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:03<00:10, 3.52s/it]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:07<00:07, 3.59s/it]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:12<00:04, 4.58s/it]
(VllmWorkerProcess pid=831706) INFO 07-24 17:35:37 [default_loader.py:272] Loading weights took 17.42 seconds
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:17<00:00, 4.75s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:17<00:00, 4.48s/it]
INFO 07-24 17:35:37 [default_loader.py:272] Loading weights took 18.04 seconds
(VllmWorkerProcess pid=831706) INFO 07-24 17:35:37 [model_runner.py:1203] Model loading took 9.7069 GiB and 17.758631 seconds
INFO 07-24 17:35:38 [model_runner.py:1203] Model loading took 9.7069 GiB and 18.381705 seconds
Using a slow image processor as use_fast
is unset and a slow processor was saved with this model. use_fast=True
will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False
.
(VllmWorkerProcess pid=831706) Using a slow image processor as use_fast
is unset and a slow processor was saved with this model. use_fast=True
will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False
.
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks.
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] Traceback (most recent call last):
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/executor/multiproc_worker_utils.py", line 233, in _run_worker_process
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] output = run_method(worker, method, args, kwargs)
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/utils/init.py", line 2736, in run_method
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] return func(*args, **kwargs)
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] return func(*args, **kwargs)
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/worker/worker.py", line 256, in determine_num_available_blocks
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] self.model_runner.profile_run()
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] return func(*args, **kwargs)
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1300, in profile_run
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] self._dummy_run(max_num_batched_tokens, max_num_seqs)
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1426, in _dummy_run
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] self.execute_model(model_input, kv_caches, intermediate_tensors)
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] return func(*args, **kwargs)
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1844, in execute_model
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] ^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/model_executor/models/glm4_1v.py", line 1555, in forward
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] inputs_embeds = self.get_input_embeddings_v0(
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/model_executor/models/glm4_1v.py", line 1500, in get_input_embeddings_v0
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] inputs_embeds = merge_multimodal_embeddings(
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/model_executor/models/utils.py", line 511, in merge_multimodal_embeddings
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] return _merge_multimodal_embeddings(
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/model_executor/models/utils.py", line 427, in _merge_multimodal_embeddings
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] raise ValueError(
(VllmWorkerProcess pid=831706) ERROR 07-24 17:36:48 [multiproc_worker_utils.py:239] ValueError: Attempted to assign 100 = 100 multimodal tokens to 30000 placeholders
ERROR 07-24 17:36:48 [engine.py:458] Attempted to assign 100 = 100 multimodal tokens to 30000 placeholders
ERROR 07-24 17:36:48 [engine.py:458] Traceback (most recent call last):
ERROR 07-24 17:36:48 [engine.py:458] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 446, in run_mp_engine
ERROR 07-24 17:36:48 [engine.py:458] engine = MQLLMEngine.from_vllm_config(
ERROR 07-24 17:36:48 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 17:36:48 [engine.py:458] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 133, in from_vllm_config
ERROR 07-24 17:36:48 [engine.py:458] return cls(
ERROR 07-24 17:36:48 [engine.py:458] ^^^^
ERROR 07-24 17:36:48 [engine.py:458] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 87, in init
ERROR 07-24 17:36:48 [engine.py:458] self.engine = LLMEngine(*args, **kwargs)
ERROR 07-24 17:36:48 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 17:36:48 [engine.py:458] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 268, in init
ERROR 07-24 17:36:48 [engine.py:458] self._initialize_kv_caches()
ERROR 07-24 17:36:48 [engine.py:458] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 413, in _initialize_kv_caches
ERROR 07-24 17:36:48 [engine.py:458] self.model_executor.determine_num_available_blocks())
ERROR 07-24 17:36:48 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 17:36:48 [engine.py:458] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 104, in determine_num_available_blocks
ERROR 07-24 17:36:48 [engine.py:458] results = self.collective_rpc("determine_num_available_blocks")
ERROR 07-24 17:36:48 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 17:36:48 [engine.py:458] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 332, in collective_rpc
ERROR 07-24 17:36:48 [engine.py:458] return self._run_workers(method, *args, **(kwargs or {}))
ERROR 07-24 17:36:48 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 17:36:48 [engine.py:458] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/executor/mp_distributed_executor.py", line 186, in _run_workers
ERROR 07-24 17:36:48 [engine.py:458] driver_worker_output = run_method(self.driver_worker, sent_method,
ERROR 07-24 17:36:48 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 17:36:48 [engine.py:458] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/utils/init.py", line 2736, in run_method
ERROR 07-24 17:36:48 [engine.py:458] return func(*args, **kwargs)
ERROR 07-24 17:36:48 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 17:36:48 [engine.py:458] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 07-24 17:36:48 [engine.py:458] return func(*args, **kwargs)
ERROR 07-24 17:36:48 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 17:36:48 [engine.py:458] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/worker/worker.py", line 256, in determine_num_available_blocks
ERROR 07-24 17:36:48 [engine.py:458] self.model_runner.profile_run()
ERROR 07-24 17:36:48 [engine.py:458] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 07-24 17:36:48 [engine.py:458] return func(*args, **kwargs)
ERROR 07-24 17:36:48 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 17:36:48 [engine.py:458] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1300, in profile_run
ERROR 07-24 17:36:48 [engine.py:458] self._dummy_run(max_num_batched_tokens, max_num_seqs)
ERROR 07-24 17:36:48 [engine.py:458] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1426, in _dummy_run
ERROR 07-24 17:36:48 [engine.py:458] self.execute_model(model_input, kv_caches, intermediate_tensors)
ERROR 07-24 17:36:48 [engine.py:458] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 07-24 17:36:48 [engine.py:458] return func(*args, **kwargs)
ERROR 07-24 17:36:48 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 17:36:48 [engine.py:458] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1844, in execute_model
ERROR 07-24 17:36:48 [engine.py:458] hidden_or_intermediate_states = model_executable(
ERROR 07-24 17:36:48 [engine.py:458] ^^^^^^^^^^^^^^^^^
ERROR 07-24 17:36:48 [engine.py:458] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 07-24 17:36:48 [engine.py:458] return self._call_impl(*args, **kwargs)
ERROR 07-24 17:36:48 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 17:36:48 [engine.py:458] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 07-24 17:36:48 [engine.py:458] return forward_call(*args, **kwargs)
ERROR 07-24 17:36:48 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 17:36:48 [engine.py:458] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/model_executor/models/glm4_1v.py", line 1555, in forward
ERROR 07-24 17:36:48 [engine.py:458] inputs_embeds = self.get_input_embeddings_v0(
ERROR 07-24 17:36:48 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 17:36:48 [engine.py:458] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/model_executor/models/glm4_1v.py", line 1500, in get_input_embeddings_v0
ERROR 07-24 17:36:48 [engine.py:458] inputs_embeds = merge_multimodal_embeddings(
ERROR 07-24 17:36:48 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 17:36:48 [engine.py:458] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/model_executor/models/utils.py", line 511, in merge_multimodal_embeddings
ERROR 07-24 17:36:48 [engine.py:458] return _merge_multimodal_embeddings(
ERROR 07-24 17:36:48 [engine.py:458] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-24 17:36:48 [engine.py:458] File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/model_executor/models/utils.py", line 427, in _merge_multimodal_embeddings
ERROR 07-24 17:36:48 [engine.py:458] raise ValueError(
ERROR 07-24 17:36:48 [engine.py:458] ValueError: Attempted to assign 100 = 100 multimodal tokens to 30000 placeholders
INFO 07-24 17:36:48 [multiproc_worker_utils.py:125] Killing local vLLM worker processes
Process SpawnProcess-1:
Traceback (most recent call last):
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 460, in run_mp_engine
raise e from None
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 446, in run_mp_engine
engine = MQLLMEngine.from_vllm_config(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 133, in from_vllm_config
return cls(
^^^^
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 87, in init
self.engine = LLMEngine(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 268, in init
self._initialize_kv_caches()
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 413, in _initialize_kv_caches
self.model_executor.determine_num_available_blocks())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 104, in determine_num_available_blocks
results = self.collective_rpc("determine_num_available_blocks")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 332, in collective_rpc
return self._run_workers(method, *args, **(kwargs or {}))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/executor/mp_distributed_executor.py", line 186, in _run_workers
driver_worker_output = run_method(self.driver_worker, sent_method,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/utils/init.py", line 2736, in run_method
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/worker/worker.py", line 256, in determine_num_available_blocks
self.model_runner.profile_run()
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1300, in profile_run
self._dummy_run(max_num_batched_tokens, max_num_seqs)
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1426, in _dummy_run
self.execute_model(model_input, kv_caches, intermediate_tensors)
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1844, in execute_model
hidden_or_intermediate_states = model_executable(
^^^^^^^^^^^^^^^^^
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/model_executor/models/glm4_1v.py", line 1555, in forward
inputs_embeds = self.get_input_embeddings_v0(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/model_executor/models/glm4_1v.py", line 1500, in get_input_embeddings_v0
inputs_embeds = merge_multimodal_embeddings(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/model_executor/models/utils.py", line 511, in merge_multimodal_embeddings
return _merge_multimodal_embeddings(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/model_executor/models/utils.py", line 427, in _merge_multimodal_embeddings
raise ValueError(
ValueError: Attempted to assign 100 = 100 multimodal tokens to 30000 placeholders
[rank0]:[W724 17:36:49.096568445 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Traceback (most recent call last):
File "/mnt/vdb/anaconda3/envs/glm4.1v/bin/vllm", line 8, in
sys.exit(main())
^^^^^^
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/entrypoints/cli/main.py", line 65, in main
args.dispatch_function(args)
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/entrypoints/cli/serve.py", line 55, in cmd
uvloop.run(run_server(args))
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/uvloop/init.py", line 105, in run
return runner.run(wrapper())
^^^^^^^^^^^^^^^^^^^^^
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/uvloop/init.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 1431, in run_server
await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 1451, in run_server_worker
async with build_async_engine_client(args, client_config) as engine_client:
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 158, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 291, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.
/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/mnt/vdb/anaconda3/envs/glm4.1v/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.