-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Your current environment
The output of python collect_env.py
INFO 07-08 15:11:24 [__init__.py:244] Automatically detected platform cuda.
Collecting environment information...
==============================
System Info
==============================
OS : Debian GNU/Linux 12 (bookworm) (x86_64)
GCC version : (Debian 12.2.0-14+deb12u1) 12.2.0
Clang version : Could not collect
CMake version : Could not collect
Libc version : glibc-2.36
==============================
PyTorch Info
==============================
PyTorch version : 2.7.0+cu126
Is debug build : False
CUDA used to build PyTorch : 12.6
ROCM used to build PyTorch : N/A
==============================
Python Environment
==============================
Python version : 3.10.18 | packaged by conda-forge | (main, Jun 4 2025, 14:45:41) [GCC 13.3.0] (64-bit runtime)
Python platform : Linux-5.4.0-166-generic-x86_64-with-glibc2.36
==============================
CUDA / GPU Info
==============================
Is CUDA available : True
CUDA runtime version : Could not collect
CUDA_MODULE_LOADING set to : LAZY
GPU models and configuration : GPU 0: NVIDIA A10
Nvidia driver version : 570.124.06
cuDNN version : Could not collect
HIP runtime version : N/A
MIOpen runtime version : N/A
Is XNNPACK available : True
==============================
CPU Info
==============================
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 28
On-line CPU(s) list: 0-27
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7K83 64-Core Processor
CPU family: 25
Model: 1
Thread(s) per core: 1
Core(s) per socket: 28
Socket(s): 1
Stepping: 0
BogoMIPS: 5090.43
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid amd_dcm tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 erms rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 arat
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 896 KiB (28 instances)
L1i cache: 896 KiB (28 instances)
L2 cache: 7 MiB (14 instances)
L3 cache: 64 MiB (2 instances)
NUMA node(s): 1
NUMA node0 CPU(s): 0-27
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
==============================
Versions of relevant libraries
==============================
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.6.4.1
[pip3] nvidia-cuda-cupti-cu12==12.6.80
[pip3] nvidia-cuda-nvrtc-cu12==12.6.77
[pip3] nvidia-cuda-runtime-cu12==12.6.77
[pip3] nvidia-cudnn-cu12==9.5.1.17
[pip3] nvidia-cufft-cu12==11.3.0.4
[pip3] nvidia-cufile-cu12==1.11.1.6
[pip3] nvidia-curand-cu12==10.3.7.77
[pip3] nvidia-cusolver-cu12==11.7.1.2
[pip3] nvidia-cusparse-cu12==12.5.4.2
[pip3] nvidia-cusparselt-cu12==0.6.3
[pip3] nvidia-nccl-cu12==2.26.2
[pip3] nvidia-nvjitlink-cu12==12.6.85
[pip3] nvidia-nvtx-cu12==12.6.77
[pip3] pyzmq==27.0.0
[pip3] torch==2.7.0
[pip3] torchaudio==2.7.0
[pip3] torchvision==0.22.0
[pip3] transformers==4.53.0
[pip3] triton==3.3.0
[conda] numpy 2.2.6 pypi_0 pypi
[conda] nvidia-cublas-cu12 12.6.4.1 pypi_0 pypi
[conda] nvidia-cuda-cupti-cu12 12.6.80 pypi_0 pypi
[conda] nvidia-cuda-nvrtc-cu12 12.6.77 pypi_0 pypi
[conda] nvidia-cuda-runtime-cu12 12.6.77 pypi_0 pypi
[conda] nvidia-cudnn-cu12 9.5.1.17 pypi_0 pypi
[conda] nvidia-cufft-cu12 11.3.0.4 pypi_0 pypi
[conda] nvidia-cufile-cu12 1.11.1.6 pypi_0 pypi
[conda] nvidia-curand-cu12 10.3.7.77 pypi_0 pypi
[conda] nvidia-cusolver-cu12 11.7.1.2 pypi_0 pypi
[conda] nvidia-cusparse-cu12 12.5.4.2 pypi_0 pypi
[conda] nvidia-cusparselt-cu12 0.6.3 pypi_0 pypi
[conda] nvidia-nccl-cu12 2.26.2 pypi_0 pypi
[conda] nvidia-nvjitlink-cu12 12.6.85 pypi_0 pypi
[conda] nvidia-nvtx-cu12 12.6.77 pypi_0 pypi
[conda] pyzmq 27.0.0 pypi_0 pypi
[conda] torch 2.7.0 pypi_0 pypi
[conda] torchaudio 2.7.0 pypi_0 pypi
[conda] torchvision 0.22.0 pypi_0 pypi
[conda] transformers 4.53.0 pypi_0 pypi
[conda] triton 3.3.0 pypi_0 pypi
==============================
vLLM Info
==============================
ROCM Version : Could not collect
Neuron SDK Version : N/A
vLLM Version : 0.9.1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X 0-27 0 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
==============================
Environment Variables
==============================
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY
🐛 Describe the bug
I am deploying my fine tuned model and encountered flashAttention problem. However it works well when I run command vllm serve /root/.cache/huggingface/Qwen/Qwen2.5-7B-Instruct-GPTQ-Int8 --quantization gptq
vllm serve /usr/rag
Here is my error message
INFO 07-08 14:54:38 [__init__.py:244] Automatically detected platform cuda.
INFO 07-08 14:54:43 [api_server.py:1287] vLLM API server version 0.9.1
INFO 07-08 14:54:44 [cli_args.py:309] non-default args: {'model': '/usr/rag'}
INFO 07-08 14:54:51 [config.py:823] This model supports multiple tasks: {'score', 'embed', 'reward', 'generate', 'classify'}. Defaulting to 'generate'.
INFO 07-08 14:54:51 [config.py:3268] Downcasting torch.float32 to torch.bfloat16.
INFO 07-08 14:54:51 [config.py:2195] Chunked prefill is enabled with max_num_batched_tokens=2048.
WARNING 07-08 14:54:53 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: https://github.com/NVIDIA/nccl/issues/1234
INFO 07-08 14:54:55 [__init__.py:244] Automatically detected platform cuda.
INFO 07-08 14:54:58 [core.py:455] Waiting for init message from front-end.
INFO 07-08 14:54:58 [core.py:70] Initializing a V1 LLM engine (v0.9.1) with config: model='/usr/rag', speculative_config=None, tokenizer='/usr/rag', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1010000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/usr/rag, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
WARNING 07-08 14:54:58 [utils.py:2737] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f96b7597a00>
INFO 07-08 14:54:58 [parallel_state.py:1065] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
WARNING 07-08 14:54:58 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
INFO 07-08 14:54:58 [gpu_model_runner.py:1595] Starting to load model /usr/rag...
INFO 07-08 14:54:59 [gpu_model_runner.py:1600] Loading model from scratch...
INFO 07-08 14:54:59 [cuda.py:252] Using Flash Attention backend on V1 engine.
ERROR 07-08 14:54:59 [core.py:515] EngineCore failed to start.
ERROR 07-08 14:54:59 [core.py:515] Traceback (most recent call last):
ERROR 07-08 14:54:59 [core.py:515] File "/opt/conda/envs/teleagent-vllm/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 506, in run_engine_core
ERROR 07-08 14:54:59 [core.py:515] engine_core = EngineCoreProc(*args, **kwargs)
ERROR 07-08 14:54:59 [core.py:515] File "/opt/conda/envs/teleagent-vllm/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 390, in __init__
ERROR 07-08 14:54:59 [core.py:515] super().__init__(vllm_config, executor_class, log_stats,
ERROR 07-08 14:54:59 [core.py:515] File "/opt/conda/envs/teleagent-vllm/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 76, in __init__
ERROR 07-08 14:54:59 [core.py:515] self.model_executor = executor_class(vllm_config)
ERROR 07-08 14:54:59 [core.py:515] File "/opt/conda/envs/teleagent-vllm/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 53, in __init__
ERROR 07-08 14:54:59 [core.py:515] self._init_executor()
ERROR 07-08 14:54:59 [core.py:515] File "/opt/conda/envs/teleagent-vllm/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 48, in _init_executor
ERROR 07-08 14:54:59 [core.py:515] self.collective_rpc("load_model")
ERROR 07-08 14:54:59 [core.py:515] File "/opt/conda/envs/teleagent-vllm/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
ERROR 07-08 14:54:59 [core.py:515] answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 07-08 14:54:59 [core.py:515] File "/opt/conda/envs/teleagent-vllm/lib/python3.10/site-packages/vllm/utils.py", line 2671, in run_method
ERROR 07-08 14:54:59 [core.py:515] return func(*args, **kwargs)
ERROR 07-08 14:54:59 [core.py:515] File "/opt/conda/envs/teleagent-vllm/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 180, in load_model
ERROR 07-08 14:54:59 [core.py:515] self.model_runner.load_model()
ERROR 07-08 14:54:59 [core.py:515] File "/opt/conda/envs/teleagent-vllm/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1601, in load_model
ERROR 07-08 14:54:59 [core.py:515] self.model = model_loader.load_model(
ERROR 07-08 14:54:59 [core.py:515] File "/opt/conda/envs/teleagent-vllm/lib/python3.10/site-packages/vllm/model_executor/model_loader/base_loader.py", line 38, in load_model
ERROR 07-08 14:54:59 [core.py:515] model = initialize_model(vllm_config=vllm_config,
ERROR 07-08 14:54:59 [core.py:515] File "/opt/conda/envs/teleagent-vllm/lib/python3.10/site-packages/vllm/model_executor/model_loader/utils.py", line 62, in initialize_model
ERROR 07-08 14:54:59 [core.py:515] return model_class(vllm_config=vllm_config, prefix=prefix)
ERROR 07-08 14:54:59 [core.py:515] File "/opt/conda/envs/teleagent-vllm/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 447, in __init__
ERROR 07-08 14:54:59 [core.py:515] self.model = Qwen2Model(vllm_config=vllm_config,
ERROR 07-08 14:54:59 [core.py:515] File "/opt/conda/envs/teleagent-vllm/lib/python3.10/site-packages/vllm/compilation/decorators.py", line 152, in __init__
ERROR 07-08 14:54:59 [core.py:515] old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
ERROR 07-08 14:54:59 [core.py:515] File "/opt/conda/envs/teleagent-vllm/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 316, in __init__
ERROR 07-08 14:54:59 [core.py:515] self.start_layer, self.end_layer, self.layers = make_layers(
ERROR 07-08 14:54:59 [core.py:515] File "/opt/conda/envs/teleagent-vllm/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 626, in make_layers
ERROR 07-08 14:54:59 [core.py:515] [PPMissingLayer() for _ in range(start_layer)] + [
ERROR 07-08 14:54:59 [core.py:515] File "/opt/conda/envs/teleagent-vllm/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 627, in <listcomp>
ERROR 07-08 14:54:59 [core.py:515] maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
ERROR 07-08 14:54:59 [core.py:515] File "/opt/conda/envs/teleagent-vllm/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 318, in <lambda>
ERROR 07-08 14:54:59 [core.py:515] lambda prefix: decoder_layer_type(config=config,
ERROR 07-08 14:54:59 [core.py:515] File "/opt/conda/envs/teleagent-vllm/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 215, in __init__
ERROR 07-08 14:54:59 [core.py:515] self.self_attn = Qwen2Attention(
ERROR 07-08 14:54:59 [core.py:515] File "/opt/conda/envs/teleagent-vllm/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 161, in __init__
ERROR 07-08 14:54:59 [core.py:515] self.attn = Attention(
ERROR 07-08 14:54:59 [core.py:515] File "/opt/conda/envs/teleagent-vllm/lib/python3.10/site-packages/vllm/attention/layer.py", line 137, in __init__
ERROR 07-08 14:54:59 [core.py:515] self.impl = impl_cls(num_heads, head_size, scale, num_kv_heads,
ERROR 07-08 14:54:59 [core.py:515] TypeError: FlashAttentionImpl.__init__() got an unexpected keyword argument 'layer_idx'
Process EngineCore_0:
Traceback (most recent call last):
File "/opt/conda/envs/teleagent-vllm/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/opt/conda/envs/teleagent-vllm/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/envs/teleagent-vllm/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 519, in run_engine_core
raise e
File "/opt/conda/envs/teleagent-vllm/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 506, in run_engine_core
engine_core = EngineCoreProc(*args, **kwargs)
File "/opt/conda/envs/teleagent-vllm/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 390, in __init__
super().__init__(vllm_config, executor_class, log_stats,
File "/opt/conda/envs/teleagent-vllm/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 76, in __init__
self.model_executor = executor_class(vllm_config)
File "/opt/conda/envs/teleagent-vllm/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 53, in __init__
self._init_executor()
File "/opt/conda/envs/teleagent-vllm/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 48, in _init_executor
self.collective_rpc("load_model")
File "/opt/conda/envs/teleagent-vllm/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
answer = run_method(self.driver_worker, method, args, kwargs)
File "/opt/conda/envs/teleagent-vllm/lib/python3.10/site-packages/vllm/utils.py", line 2671, in run_method
return func(*args, **kwargs)
File "/opt/conda/envs/teleagent-vllm/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 180, in load_model
self.model_runner.load_model()
File "/opt/conda/envs/teleagent-vllm/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1601, in load_model
self.model = model_loader.load_model(
File "/opt/conda/envs/teleagent-vllm/lib/python3.10/site-packages/vllm/model_executor/model_loader/base_loader.py", line 38, in load_model
model = initialize_model(vllm_config=vllm_config,
File "/opt/conda/envs/teleagent-vllm/lib/python3.10/site-packages/vllm/model_executor/model_loader/utils.py", line 62, in initialize_model
return model_class(vllm_config=vllm_config, prefix=prefix)
File "/opt/conda/envs/teleagent-vllm/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 447, in __init__
self.model = Qwen2Model(vllm_config=vllm_config,
File "/opt/conda/envs/teleagent-vllm/lib/python3.10/site-packages/vllm/compilation/decorators.py", line 152, in __init__
old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
File "/opt/conda/envs/teleagent-vllm/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 316, in __init__
self.start_layer, self.end_layer, self.layers = make_layers(
File "/opt/conda/envs/teleagent-vllm/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 626, in make_layers
[PPMissingLayer() for _ in range(start_layer)] + [
File "/opt/conda/envs/teleagent-vllm/lib/python3.10/site-packages/vllm/model_executor/models/utils.py", line 627, in <listcomp>
maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
File "/opt/conda/envs/teleagent-vllm/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 318, in <lambda>
lambda prefix: decoder_layer_type(config=config,
File "/opt/conda/envs/teleagent-vllm/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 215, in __init__
self.self_attn = Qwen2Attention(
File "/opt/conda/envs/teleagent-vllm/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 161, in __init__
self.attn = Attention(
File "/opt/conda/envs/teleagent-vllm/lib/python3.10/site-packages/vllm/attention/layer.py", line 137, in __init__
self.impl = impl_cls(num_heads, head_size, scale, num_kv_heads,
TypeError: FlashAttentionImpl.__init__() got an unexpected keyword argument 'layer_idx'
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working