-
Notifications
You must be signed in to change notification settings - Fork 494
Description
Your current environment
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/utils/_path_manager.py:66: UserWarning: Permission mismatch: The owner of /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libop_plugin_atb.so does not match.
warnings.warn(f"Permission mismatch: The owner of {path} does not match.")
Collecting environment information...
PyTorch version: 2.7.1+cpu
Is debug build: False
OS: Ubuntu 22.04.5 LTS (aarch64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 4.1.0
Libc version: glibc-2.35
Python version: 3.11.13 (main, Jul 26 2025, 07:27:32) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-47-generic-aarch64-with-glibc2.35
CPU:
Architecture: aarch64
CPU op-mode(s): 64-bit
Byte Order: Little Endian
CPU(s): 192
On-line CPU(s) list: 0-191
Vendor ID: HiSilicon
Model name: Kunpeng-920
Model: 0
Thread(s) per core: 1
Core(s) per cluster: 48
Socket(s): -
Cluster(s): 4
Stepping: 0x1
BogoMIPS: 200.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm ssbs
L1d cache: 12 MiB (192 instances)
L1i cache: 12 MiB (192 instances)
L2 cache: 96 MiB (192 instances)
L3 cache: 192 MiB (8 instances)
NUMA node(s): 8
NUMA node0 CPU(s): 0-23
NUMA node1 CPU(s): 24-47
NUMA node2 CPU(s): 48-71
NUMA node3 CPU(s): 72-95
NUMA node4 CPU(s): 96-119
NUMA node5 CPU(s): 120-143
NUMA node6 CPU(s): 144-167
NUMA node7 CPU(s): 168-191
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; __user pointer sanitization
Vulnerability Spectre v2: Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] pyzmq==27.0.2
[pip3] sentence-transformers==4.1.0
[pip3] torch==2.7.1+cpu
[pip3] torch_npu==2.7.1.dev20250724
[pip3] torchvision==0.22.1
[pip3] transformers==4.54.0.dev0
[conda] Could not collect
vLLM Version: 0.10.1.1
vLLM Ascend Version: 0.10.1rc1
ENV Variables:
ATB_OPSRUNNER_KERNEL_CACHE_LOCAL_COUNT=1
ATB_STREAM_SYNC_EVERY_RUNNER_ENABLE=0
ATB_OPSRUNNER_SETUP_CACHE_ENABLE=1
ATB_WORKSPACE_MEM_ALLOC_GLOBAL=0
ATB_DEVICE_TILING_BUFFER_BLOCK_NUM=32
ATB_STREAM_SYNC_EVERY_KERNEL_ENABLE=0
ATB_OPSRUNNER_KERNEL_CACHE_GLOABL_COUNT=5
ATB_HOME_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0
ASCEND_TOOLKIT_HOME=/usr/local/Ascend/ascend-toolkit/latest
ATB_COMPARE_TILING_EVERY_KERNEL=0
ASCEND_OPP_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp
LD_LIBRARY_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling:/usr/local/Ascend/driver/lib64/common/:/usr/local/Ascend/driver/lib64/driver/:
ASCEND_AICPU_PATH=/usr/local/Ascend/ascend-toolkit/latest
ATB_STREAM_SYNC_EVERY_OPERATION_ENABLE=0
ASCEND_HOME_PATH=/usr/local/Ascend/ascend-toolkit/latest
ATB_MATMUL_SHUFFLE_K_ENABLE=1
ATB_WORKSPACE_MEM_ALLOC_ALG_TYPE=1
ATB_HOST_TILING_BUFFER_BLOCK_NUM=128
ATB_SHARE_MEMORY_NAME_SUFFIX=
TORCH_DEVICE_BACKEND_AUTOLOAD=1
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
NPU:
+------------------------------------------------------------------------------------------------+
| npu-smi 24.1.rc2.2 Version: 24.1.rc2.2 |
+---------------------------+---------------+----------------------------------------------------+
| NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)|
| Chip | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) |
+===========================+===============+====================================================+
| 0 910B3 | OK | 95.0 40 0 / 0 |
| 0 | 0000:C1:00.0 | 0 0 / 0 3399 / 65536 |
+===========================+===============+====================================================+
| 1 910B3 | OK | 93.0 38 0 / 0 |
| 0 | 0000:C2:00.0 | 0 0 / 0 3390 / 65536 |
+===========================+===============+====================================================+
| 2 910B3 | OK | 91.2 39 0 / 0 |
| 0 | 0000:81:00.0 | 0 0 / 0 3388 / 65536 |
+===========================+===============+====================================================+
| 3 910B3 | OK | 96.5 41 0 / 0 |
| 0 | 0000:82:00.0 | 0 0 / 0 3389 / 65536 |
+===========================+===============+====================================================+
+---------------------------+---------------+----------------------------------------------------+
| NPU Chip | Process id | Process name | Process memory(MB) |
+===========================+===============+====================================================+
| No running processes found in NPU 0 |
+===========================+===============+====================================================+
| No running processes found in NPU 1 |
+===========================+===============+====================================================+
| No running processes found in NPU 2 |
+===========================+===============+====================================================+
| No running processes found in NPU 3 |
+===========================+===============+====================================================+
CANN:
package_name=Ascend-cann-toolkit
version=8.2.RC1
innerversion=V100R001C22SPC001B231
compatible_version=[V100R001C15],[V100R001C18],[V100R001C19],[V100R001C20],[V100R001C21],[V100R001C23]
arch=aarch64
os=linux
path=/usr/local/Ascend/ascend-toolkit/8.2.RC1/aarch64-linux
🐛 Describe the bug
启动服务命令
source /usr/local/Ascend/ascend-toolkit/set_env.sh && source /usr/local/Ascend/nnal/atb/set_env.sh && export LD_LIBRARY_PATH=/usr/local/lib/:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/aarch64:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling:/usr/local/Ascend/driver/lib64/common/:/usr/local/Ascend/driver/lib64/driver/: && cd /vllm-workspace/vllm && python3 -m vllm.entrypoints.openai.api_server --model /model --tensor-parallel-size 4 --pipeline-parallel-size 1 --trust-remote-code --host 0.0.0.0 --port 8669 --served-model-name Qwen2.5-72B --uvicorn-log-level info --tool-call-parser hermes --enable-auto-tool-choice --gpu-memory-utilization 0.92 --enable-prefix-caching --max_num_batched_tokens 32768 --additional-config {"torchair_graph_config":{"enabled":true},"ascend_scheduler_config":{"enabled":true}}
报错信息
[RAW SERVER] (VllmWorker TP0 pid=787)
[RAW SERVER] Loading safetensors checkpoint shards: 100% Completed | 37/37 [01:25<00:00, 2.32s/it]
[RAW SERVER] (VllmWorker TP0 pid=787)
[RAW SERVER] (VllmWorker TP0 pid=787) INFO 09-05 06:44:16 [default_loader.py:262] Loading weights took 85.90 seconds
[RAW SERVER] (VllmWorker TP1 pid=788) INFO 09-05 06:44:16 [default_loader.py:262] Loading weights took 88.37 seconds
[RAW SERVER] (VllmWorker TP3 pid=790) INFO 09-05 06:44:16 [default_loader.py:262] Loading weights took 85.12 seconds
[RAW SERVER] (VllmWorker TP3 pid=790) INFO 09-05 06:44:17 [model_runner_v1.py:2312] Loading model weights took 33.9994 GB
[RAW SERVER] (VllmWorker TP0 pid=787) INFO 09-05 06:44:17 [model_runner_v1.py:2312] Loading model weights took 33.9994 GB
[RAW SERVER] (VllmWorker TP2 pid=789) INFO 09-05 06:44:17 [default_loader.py:262] Loading weights took 88.23 seconds
[RAW SERVER] (VllmWorker TP1 pid=788) INFO 09-05 06:44:17 [model_runner_v1.py:2312] Loading model weights took 33.9994 GB
[RAW SERVER] (VllmWorker TP2 pid=789) INFO 09-05 06:44:18 [model_runner_v1.py:2312] Loading model weights took 33.9994 GB
2025-09-05 06:44:19 INFO [llmserver] : OPENAI service not ready yet, retrying in 15 seconds...
[RAW SERVER] (VllmWorker TP3 pid=790) INFO 09-05 06:44:32 [worker_v1.py:190] Available memory: 17716460605, total memory: 65464696832
[RAW SERVER] (VllmWorker TP3 pid=790) INFO 09-05 06:44:32 [torchair_worker.py:54] Use new kv_cache_bytes: 17649351741
[RAW SERVER] (VllmWorker TP2 pid=789) INFO 09-05 06:44:32 [worker_v1.py:190] Available memory: 17717066813, total memory: 65464696832
[RAW SERVER] (VllmWorker TP2 pid=789) INFO 09-05 06:44:32 [torchair_worker.py:54] Use new kv_cache_bytes: 17649957949
[RAW SERVER] (VllmWorker TP0 pid=787) INFO 09-05 06:44:32 [worker_v1.py:190] Available memory: 17499868221, total memory: 65464696832
[RAW SERVER] (VllmWorker TP0 pid=787) INFO 09-05 06:44:32 [torchair_worker.py:54] Use new kv_cache_bytes: 17432759357
[RAW SERVER] (VllmWorker TP1 pid=788) INFO 09-05 06:44:32 [worker_v1.py:190] Available memory: 17715563581, total memory: 65464696832
[RAW SERVER] (VllmWorker TP1 pid=788) INFO 09-05 06:44:32 [torchair_worker.py:54] Use new kv_cache_bytes: 17648454717
[RAW SERVER] (EngineCore_0 pid=651) INFO 09-05 06:44:32 [kv_cache_utils.py:849] GPU KV cache size: 212,736 tokens
[RAW SERVER] (EngineCore_0 pid=651) INFO 09-05 06:44:32 [kv_cache_utils.py:853] Maximum concurrency for 32,768 tokens per request: 6.49x
[RAW SERVER] (EngineCore_0 pid=651) INFO 09-05 06:44:32 [kv_cache_utils.py:849] GPU KV cache size: 215,424 tokens
[RAW SERVER] (EngineCore_0 pid=651) INFO 09-05 06:44:32 [kv_cache_utils.py:853] Maximum concurrency for 32,768 tokens per request: 6.57x
[RAW SERVER] (EngineCore_0 pid=651) INFO 09-05 06:44:32 [kv_cache_utils.py:849] GPU KV cache size: 215,424 tokens
[RAW SERVER] (EngineCore_0 pid=651) INFO 09-05 06:44:32 [kv_cache_utils.py:853] Maximum concurrency for 32,768 tokens per request: 6.57x
[RAW SERVER] (EngineCore_0 pid=651) INFO 09-05 06:44:32 [kv_cache_utils.py:849] GPU KV cache size: 215,424 tokens
[RAW SERVER] (EngineCore_0 pid=651) INFO 09-05 06:44:32 [kv_cache_utils.py:853] Maximum concurrency for 32,768 tokens per request: 6.57x
[RAW SERVER] (VllmWorker TP1 pid=788) INFO 09-05 06:44:32 [torchair_model_runner.py:221] Capturing torchair graph, this usually takes 1.03.0 mins.3.0 mins.
[RAW SERVER] (VllmWorker TP2 pid=789) INFO 09-05 06:44:32 [torchair_model_runner.py:221] Capturing torchair graph, this usually takes 1.0
[RAW SERVER] (VllmWorker TP3 pid=790) INFO 09-05 06:44:32 [torchair_model_runner.py:221] Capturing torchair graph, this usually takes 1.03.0 mins.3.0 mins.
[RAW SERVER] (VllmWorker TP0 pid=787) INFO 09-05 06:44:32 [torchair_model_runner.py:221] Capturing torchair graph, this usually takes 1.0
2025-09-05 06:44:34 INFO [llmserver] : OPENAI service not ready yet, retrying in 15 seconds...
2025-09-05 06:44:49 INFO [llmserver] : OPENAI service not ready yet, retrying in 15 seconds...
2025-09-05 06:45:04 INFO [llmserver] : OPENAI service not ready yet, retrying in 15 seconds...
2025-09-05 06:45:19 INFO [llmserver] : OPENAI service not ready yet, retrying in 15 seconds...
2025-09-05 06:45:34 INFO [llmserver] : OPENAI service not ready yet, retrying in 15 seconds...
[RAW SERVER] (VllmWorker TP3 pid=790) /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/_ge_concrete_graph/fx2ge_converter.py:1012: UserWarning: When enable frozen_parameter, Parameters and input tensors with immutable data_ptr marked by torch._dynamo.mark_static_address()
will be considered frozen. Please make sure that the Parameters data address remain the same throughout the program runtime.
[RAW SERVER] (VllmWorker TP3 pid=790) warnings.warn('When enable frozen_parameter, Parameters and input tensors with immutable data_ptr '
[RAW SERVER] (VllmWorker TP2 pid=789) /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/_ge_concrete_graph/fx2ge_converter.py:1012: UserWarning: When enable frozen_parameter, Parameters and input tensors with immutable data_ptr marked by torch._dynamo.mark_static_address()
will be considered frozen. Please make sure that the Parameters data address remain the same throughout the program runtime.
[RAW SERVER] (VllmWorker TP2 pid=789) warnings.warn('When enable frozen_parameter, Parameters and input tensors with immutable data_ptr '
[RAW SERVER] (VllmWorker TP0 pid=787) /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/_ge_concrete_graph/fx2ge_converter.py:1012: UserWarning: When enable frozen_parameter, Parameters and input tensors with immutable data_ptr marked by torch._dynamo.mark_static_address()
will be considered frozen. Please make sure that the Parameters data address remain the same throughout the program runtime.
[RAW SERVER] (VllmWorker TP0 pid=787) warnings.warn('When enable frozen_parameter, Parameters and input tensors with immutable data_ptr '
[RAW SERVER] (VllmWorker TP1 pid=788) /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/_ge_concrete_graph/fx2ge_converter.py:1012: UserWarning: When enable frozen_parameter, Parameters and input tensors with immutable data_ptr marked by torch._dynamo.mark_static_address()
will be considered frozen. Please make sure that the Parameters data address remain the same throughout the program runtime.
[RAW SERVER] (VllmWorker TP1 pid=788) warnings.warn('When enable frozen_parameter, Parameters and input tensors with immutable data_ptr '
2025-09-05 06:45:49 INFO [llmserver] : OPENAI service not ready yet, retrying in 15 seconds...
[RAW SERVER] .2025-09-05 06:46:04 INFO [llmserver] : OPENAI service not ready yet, retrying in 15 seconds...
[RAW SERVER] ..[RAW SERVER] .[RAW SERVER] .[RAW SERVER] ..[RAW SERVER] .2025-09-05 06:46:19 INFO [llmserver] : OPENAI service not ready yet, retrying in 15 seconds...
[RAW SERVER] .[RAW SERVER] ..[RAW SERVER] .[RAW SERVER] .2025-09-05 06:46:34 INFO [llmserver] : OPENAI service not ready yet, retrying in 15 seconds...
[RAW SERVER] ..[RAW SERVER] .[RAW SERVER] .[RAW SERVER] ..[RAW SERVER] .2025-09-05 06:46:49 INFO [llmserver] : OPENAI service not ready yet, retrying in 15 seconds...
[RAW SERVER] (VllmWorker TP0 pid=787) INFO 09-05 06:46:53 [torchair_model_runner.py:187] Batchsize 256 is compiled successfully: 1/2.
[RAW SERVER] (VllmWorker TP3 pid=790) INFO 09-05 06:46:53 [torchair_model_runner.py:187] Batchsize 256 is compiled successfully: 1/2.
[RAW SERVER] (VllmWorker TP2 pid=789) INFO 09-05 06:46:53 [torchair_model_runner.py:187] Batchsize 256 is compiled successfully: 1/2.
[RAW SERVER] (VllmWorker TP1 pid=788) INFO 09-05 06:46:53 [torchair_model_runner.py:187] Batchsize 256 is compiled successfully: 1/2.
2025-09-05 06:47:04 INFO [llmserver] : OPENAI service not ready yet, retrying in 15 seconds...
2025-09-05 06:47:19 INFO [llmserver] : OPENAI service not ready yet, retrying in 15 seconds...
2025-09-05 06:47:34 INFO [llmserver] : OPENAI service not ready yet, retrying in 15 seconds...
[RAW SERVER] (VllmWorker TP3 pid=790) /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/_ge_concrete_graph/fx2ge_converter.py:1012: UserWarning: When enable frozen_parameter, Parameters and input tensors with immutable data_ptr marked by torch._dynamo.mark_static_address()
will be considered frozen. Please make sure that the Parameters data address remain the same throughout the program runtime.
[RAW SERVER] (VllmWorker TP3 pid=790) warnings.warn('When enable frozen_parameter, Parameters and input tensors with immutable data_ptr '
[RAW SERVER] (VllmWorker TP0 pid=787) /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/_ge_concrete_graph/fx2ge_converter.py:1012: UserWarning: When enable frozen_parameter, Parameters and input tensors with immutable data_ptr marked by torch._dynamo.mark_static_address()
will be considered frozen. Please make sure that the Parameters data address remain the same throughout the program runtime.
[RAW SERVER] (VllmWorker TP0 pid=787) warnings.warn('When enable frozen_parameter, Parameters and input tensors with immutable data_ptr '
[RAW SERVER] (VllmWorker TP2 pid=789) /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/_ge_concrete_graph/fx2ge_converter.py:1012: UserWarning: When enable frozen_parameter, Parameters and input tensors with immutable data_ptr marked by torch._dynamo.mark_static_address()
will be considered frozen. Please make sure that the Parameters data address remain the same throughout the program runtime.
[RAW SERVER] (VllmWorker TP2 pid=789) warnings.warn('When enable frozen_parameter, Parameters and input tensors with immutable data_ptr '
[RAW SERVER] (VllmWorker TP1 pid=788) /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/_ge_concrete_graph/fx2ge_converter.py:1012: UserWarning: When enable frozen_parameter, Parameters and input tensors with immutable data_ptr marked by torch._dynamo.mark_static_address()
will be considered frozen. Please make sure that the Parameters data address remain the same throughout the program runtime.
[RAW SERVER] (VllmWorker TP1 pid=788) warnings.warn('When enable frozen_parameter, Parameters and input tensors with immutable data_ptr '
2025-09-05 06:47:49 INFO [llmserver] : OPENAI service not ready yet, retrying in 15 seconds...
[RAW SERVER] .[RAW SERVER] .[RAW SERVER] .[RAW SERVER] .[RAW SERVER] ..2025-09-05 06:48:04 INFO [llmserver] : OPENAI service not ready yet, retrying in 15 seconds...
[RAW SERVER] ..[RAW SERVER] ..[RAW SERVER] ..2025-09-05 06:48:19 INFO [llmserver] : OPENAI service not ready yet, retrying in 15 seconds...
[RAW SERVER] ..[RAW SERVER] ..2025-09-05 06:48:34 INFO [llmserver] : OPENAI service not ready yet, retrying in 15 seconds...
[RAW SERVER] .[RAW SERVER] (VllmWorker TP3 pid=790) INFO 09-05 06:48:41 [torchair_model_runner.py:187] Batchsize 1 is compiled successfully: 2/2.
[RAW SERVER] (VllmWorker TP2 pid=789) INFO 09-05 06:48:41 [torchair_model_runner.py:187] Batchsize 1 is compiled successfully: 2/2.
[RAW SERVER] (VllmWorker TP3 pid=790) INFO 09-05 06:48:41 [model_runner_v1.py:2590] Graph capturing finished in 249 secs, took 0.16 GiB
[RAW SERVER] (VllmWorker TP2 pid=789) INFO 09-05 06:48:41 [model_runner_v1.py:2590] Graph capturing finished in 249 secs, took 0.15 GiB
[RAW SERVER] (VllmWorker TP0 pid=787) INFO 09-05 06:48:41 [torchair_model_runner.py:187] Batchsize 1 is compiled successfully: 2/2.
[RAW SERVER] (VllmWorker TP0 pid=787) INFO 09-05 06:48:41 [model_runner_v1.py:2590] Graph capturing finished in 249 secs, took 0.16 GiB
[RAW SERVER] (VllmWorker TP1 pid=788) INFO 09-05 06:48:41 [torchair_model_runner.py:187] Batchsize 1 is compiled successfully: 2/2.
[RAW SERVER] (VllmWorker TP1 pid=788) INFO 09-05 06:48:41 [model_runner_v1.py:2590] Graph capturing finished in 249 secs, took 0.15 GiB
[RAW SERVER] (EngineCore_0 pid=651) INFO 09-05 06:48:41 [core.py:214] init engine (profile, create kv cache, warmup model) took 263.36 seconds
[RAW SERVER] (EngineCore_0 pid=651) WARNING 09-05 06:48:42 [core.py:109] Using configured V1 scheduler class vllm_ascend.core.scheduler.AscendScheduler. This scheduler interface is not public and compatibility may not be maintained.
[RAW SERVER] (EngineCore_0 pid=651) WARNING 09-05 06:48:42 [platform.py:164] compilation_config.level = CompilationLevel.NO_COMPILATION is set, Setting CUDAGraphMode to NONE
[RAW SERVER] (EngineCore_0 pid=651) INFO 09-05 06:48:42 [platform.py:171] Torchair compilation enabled on NPU. Setting CUDAGraphMode to NONE
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:48:42 [loggers.py:142] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 1662
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:48:42 [api_server.py:1611] Supported_tasks: ['generate']
[RAW SERVER] (APIServer pid=380) WARNING 09-05 06:48:42 [init.py:1625] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with --generation-config vllm
.
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:48:42 [serving_responses.py:120] Using default chat sampling params from model: {'repetition_penalty': 1.05, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:48:42 [serving_responses.py:149] "auto" tool choice has been enabled please note that while the parallel_tool_calls client option is preset for compatibility reasons, it will be ignored.
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:48:42 [serving_chat.py:94] "auto" tool choice has been enabled please note that while the parallel_tool_calls client option is preset for compatibility reasons, it will be ignored.
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:48:42 [serving_chat.py:134] Using default chat sampling params from model: {'repetition_penalty': 1.05, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:48:42 [serving_completion.py:77] Using default completion sampling params from model: {'repetition_penalty': 1.05, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:48:42 [api_server.py:1880] Starting vLLM API server 0 on http://0.0.0.0:8669
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:48:42 [launcher.py:36] Available routes are:
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:48:42 [launcher.py:44] Route: /openapi.json, Methods: HEAD, GET
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:48:42 [launcher.py:44] Route: /docs, Methods: HEAD, GET
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:48:42 [launcher.py:44] Route: /docs/oauth2-redirect, Methods: HEAD, GET
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:48:42 [launcher.py:44] Route: /redoc, Methods: HEAD, GET
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:48:42 [launcher.py:44] Route: /health, Methods: GET
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:48:42 [launcher.py:44] Route: /load, Methods: GET
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:48:42 [launcher.py:44] Route: /ping, Methods: POST
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:48:42 [launcher.py:44] Route: /ping, Methods: GET
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:48:42 [launcher.py:44] Route: /tokenize, Methods: POST
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:48:42 [launcher.py:44] Route: /detokenize, Methods: POST
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:48:42 [launcher.py:44] Route: /v1/models, Methods: GET
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:48:42 [launcher.py:44] Route: /version, Methods: GET
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:48:42 [launcher.py:44] Route: /v1/responses, Methods: POST
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:48:42 [launcher.py:44] Route: /v1/responses/{response_id}, Methods: GET
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:48:42 [launcher.py:44] Route: /v1/responses/{response_id}/cancel, Methods: POST
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:48:42 [launcher.py:44] Route: /v1/chat/completions, Methods: POST
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:48:42 [launcher.py:44] Route: /v1/completions, Methods: POST
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:48:42 [launcher.py:44] Route: /v1/embeddings, Methods: POST
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:48:42 [launcher.py:44] Route: /pooling, Methods: POST
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:48:42 [launcher.py:44] Route: /classify, Methods: POST
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:48:42 [launcher.py:44] Route: /score, Methods: POST
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:48:42 [launcher.py:44] Route: /v1/score, Methods: POST
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:48:42 [launcher.py:44] Route: /v1/audio/transcriptions, Methods: POST
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:48:42 [launcher.py:44] Route: /v1/audio/translations, Methods: POST
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:48:42 [launcher.py:44] Route: /rerank, Methods: POST
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:48:42 [launcher.py:44] Route: /v1/rerank, Methods: POST
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:48:42 [launcher.py:44] Route: /v2/rerank, Methods: POST
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:48:42 [launcher.py:44] Route: /scale_elastic_ep, Methods: POST
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:48:42 [launcher.py:44] Route: /is_scaling_elastic_ep, Methods: POST
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:48:42 [launcher.py:44] Route: /invocations, Methods: POST
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:48:42 [launcher.py:44] Route: /metrics, Methods: GET
[RAW SERVER] (APIServer pid=380) INFO: Started server process [380]
[RAW SERVER] (APIServer pid=380) INFO: Waiting for application startup.
[RAW SERVER] (APIServer pid=380) INFO: Application startup complete.
2025-09-05 06:48:49 INFO [llmserver] : OPENAI service is ready!
2025-09-05 06:48:49 INFO [llmserver] : Starting GPU hang check task
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8702 (Press CTRL+C to quit)
[RAW SERVER] (APIServer pid=380) INFO: 127.0.0.1:58592 - "GET /v1/models HTTP/1.1" 200 OK
INFO: 127.0.0.1:34742 - "GET /v1/models HTTP/1.1" 200 OK
[RAW SERVER] (APIServer pid=380) INFO: 127.0.0.1:58602 - "GET /v1/models HTTP/1.1" 200 OK
[RAW SERVER] (APIServer pid=380) WARNING 09-05 06:48:55 [protocol.py:81] The following fields were present in the request but ignored: {'system'}
INFO: 127.0.0.1:34746 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:48:55 [chat_utils.py:470] Detected the chat template content format to be 'string'. You can set --chat-template-content-format
to override this.
[RAW SERVER] (APIServer pid=380) INFO: 127.0.0.1:58602 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[RAW SERVER] (VllmWorker TP2 pid=789) WARNING 09-05 06:48:55 [cudagraph_dispatcher.py:101] cudagraph dispatching keys are not initialized. No cudagraph will be used.
[RAW SERVER] (VllmWorker TP3 pid=790) WARNING 09-05 06:48:55 [cudagraph_dispatcher.py:101] cudagraph dispatching keys are not initialized. No cudagraph will be used.
[RAW SERVER] (VllmWorker TP1 pid=788) WARNING 09-05 06:48:55 [cudagraph_dispatcher.py:101] cudagraph dispatching keys are not initialized. No cudagraph will be used.
[RAW SERVER] (VllmWorker TP0 pid=787) WARNING 09-05 06:48:55 [cudagraph_dispatcher.py:101] cudagraph dispatching keys are not initialized. No cudagraph will be used.
INFO: 127.0.0.1:49430 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[RAW SERVER] (APIServer pid=380) INFO: 127.0.0.1:58602 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:49:03 [loggers.py:123] Engine 000: Avg prompt throughput: 6.2 tokens/s, Avg generation throughput: 11.8 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:49:13 [loggers.py:123] Engine 000: Avg prompt throughput: 5.7 tokens/s, Avg generation throughput: 18.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.2%, Prefix cache hit rate: 0.0%
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:49:23 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.3%, Prefix cache hit rate: 0.0%
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:49:33 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.4%, Prefix cache hit rate: 0.0%
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:49:43 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
INFO: 127.0.0.1:49438 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[RAW SERVER] (APIServer pid=380) INFO: 127.0.0.1:58602 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO: 127.0.0.1:37156 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[RAW SERVER] (APIServer pid=380) INFO: 127.0.0.1:58602 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:49:53 [loggers.py:123] Engine 000: Avg prompt throughput: 5.7 tokens/s, Avg generation throughput: 18.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.2%, Prefix cache hit rate: 0.0%
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:50:03 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:50:13 [loggers.py:123] Engine 000: Avg prompt throughput: 5.5 tokens/s, Avg generation throughput: 18.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.2%, Prefix cache hit rate: 0.0%
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:50:23 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.3%, Prefix cache hit rate: 0.0%
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:50:33 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.4%, Prefix cache hit rate: 0.0%
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:50:43 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
INFO: 127.0.0.1:38662 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[RAW SERVER] (APIServer pid=380) INFO: 127.0.0.1:58602 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO: 127.0.0.1:33244 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[RAW SERVER] (APIServer pid=380) INFO: 127.0.0.1:58602 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:50:53 [loggers.py:123] Engine 000: Avg prompt throughput: 5.5 tokens/s, Avg generation throughput: 18.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.2%, Prefix cache hit rate: 0.0%
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:51:03 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.2%, Prefix cache hit rate: 0.0%
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:51:13 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.4%, Prefix cache hit rate: 0.0%
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:51:23 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
INFO: 127.0.0.1:40454 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[RAW SERVER] (APIServer pid=380) INFO: 127.0.0.1:58602 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO: 127.0.0.1:40462 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[RAW SERVER] (APIServer pid=380) INFO: 127.0.0.1:58602 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO: 127.0.0.1:40472 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[RAW SERVER] (APIServer pid=380) INFO: 127.0.0.1:58602 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO: 127.0.0.1:40482 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[RAW SERVER] (APIServer pid=380) INFO: 127.0.0.1:58602 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:51:33 [loggers.py:123] Engine 000: Avg prompt throughput: 35.1 tokens/s, Avg generation throughput: 14.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.2%, Prefix cache hit rate: 0.0%
[RAW SERVER] (APIServer pid=380) INFO 09-05 06:51:43 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.3%, Prefix cache hit rate: 0.0%
INFO: 127.0.0.1:40494 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[RAW SERVER] (APIServer pid=380) INFO: 127.0.0.1:58602 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO: 127.0.0.1:57594 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[RAW SERVER] (APIServer pid=380) INFO: 127.0.0.1:58602 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[RAW SERVER] mki_log mkdir /home/bml/atb/
[RAW SERVER] mki_log mkdir /home/bml/atb/log
[RAW SERVER] [rank2]:[E905 06:51:44.673826892 compiler_depend.ts:429] SelfAttentionOperation setup failed!
[RAW SERVER] Exception raised from OperationSetup at build/third_party/op-plugin/op_plugin/CMakeFiles/op_plugin_atb.dir/compiler_depend.ts:151 (most recent call first):
[RAW SERVER] frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0xd4 (0xffff76283ea4 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch/lib/libc10.so)
[RAW SERVER] frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xe4 (0xffff76223e44 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch/lib/libc10.so)
[RAW SERVER] frame #2: atb::OperationSetup(atb::VariantPack, atb::Operation*, atb::Context*) + 0x254 (0xffff53d8ac24 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libop_plugin_atb.so)
[RAW SERVER] frame #3: + 0x8b7bc (0xffff53d8b7bc in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libop_plugin_atb.so)
[RAW SERVER] frame #4: + 0x22887d4 (0xffff68c587d4 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
[RAW SERVER] frame #5: + 0x8fb170 (0xffff672cb170 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
[RAW SERVER] frame #6: + 0x8fd504 (0xffff672cd504 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
[RAW SERVER] frame #7: + 0x8f9e2c (0xffff672c9e2c in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
[RAW SERVER] frame #8: + 0xd31fc (0xffff760931fc in /lib/aarch64-linux-gnu/libstdc++.so.6)
[RAW SERVER] frame #9: + 0x7d5b8 (0xffff822fd5b8 in /lib/aarch64-linux-gnu/libc.so.6)
[RAW SERVER] frame #10: + 0xe5edc (0xffff82365edc in /lib/aarch64-linux-gnu/libc.so.6)
[RAW SERVER]
查看ATB日志
cat /home/bml/atb/log/*.log
[2025-09-05 06:51:44.116849] [error] [1546] [tensor_check.cpp:28] tensor dimNum 0 is invalid, should >0 && <= MAX_DIM(8)
[2025-09-05 06:51:44.117190] [error] [1546] [operation_base.cpp:182] SelfAttentionOperation_1 inTensor [1] CheckTensorShape failed. ErrorType: 8
[2025-09-05 06:51:44.117204] [error] [1546] [operation_base.cpp:625] SelfAttentionOperation_1 invalid param, setup check fail, error code: 8
[2025-09-05 06:51:44.114969] [error] [1471] [tensor_check.cpp:28] tensor dimNum 0 is invalid, should >0 && <= MAX_DIM(8)
[2025-09-05 06:51:44.115455] [error] [1471] [operation_base.cpp:182] SelfAttentionOperation_1 inTensor [1] CheckTensorShape failed. ErrorType: 8
[2025-09-05 06:51:44.115467] [error] [1471] [operation_base.cpp:625] SelfAttentionOperation_1 invalid param, setup check fail, error code: 8
[2025-09-05 06:51:44.113766] [error] [1185] [tensor_check.cpp:28] tensor dimNum 0 is invalid, should >0 && <= MAX_DIM(8)
[2025-09-05 06:51:44.114359] [error] [1185] [operation_base.cpp:182] SelfAttentionOperation_1 inTensor [1] CheckTensorShape failed. ErrorType: 8
[2025-09-05 06:51:44.114370] [error] [1185] [operation_base.cpp:625] SelfAttentionOperation_1 invalid param, setup check fail, error code: 8
[2025-09-05 06:51:44.114324] [error] [1173] [tensor_check.cpp:28] tensor dimNum 0 is invalid, should >0 && <= MAX_DIM(8)
[2025-09-05 06:51:44.114811] [error] [1173] [operation_base.cpp:182] SelfAttentionOperation_1 inTensor [1] CheckTensorShape failed. ErrorType: 8
[2025-09-05 06:51:44.114823] [error] [1173] [operation_base.cpp:625] SelfAttentionOperation_1 invalid param, setup check fail, error code: 8