[Usage]: ValueError: Unexpected weight for Qwen2-VL GPTQ 4-bit custom model.

### Your current environment

```text
The output of `python collect_env.py`

WARNING 10-30 12:11:37 _custom_ops.py:19] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
PyTorch version: 2.4.0+cpu
Is debug build: False
CUDA used to build PyTorch: Could not collect
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 11 Home Single Language
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A

Python version: 3.10.0 (tags/v3.10.0:b494f59, Oct  4 2021, 19:00:18) [MSC v.1929 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.22631-SP0
Is CUDA available: False
CUDA runtime version: 12.1.66
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 2060
Nvidia driver version: 565.90
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture=9
CurrentClockSpeed=2900
DeviceID=CPU0
Family=107
L2CacheSize=4096
L2CacheSpeed=
Manufacturer=AuthenticAMD
MaxClockSpeed=2900
Name=AMD Ryzen 7 4800H with Radeon Graphics
ProcessorType=3
Revision=24577

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-ml-py==12.560.30
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0
[pip3] torchvision==0.19.0
[pip3] transformers==4.46.1
[pip3] triton==3.0.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.3
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect
```
I tried to infer my custom Qwen2-VL GPTQ 4bit model using the below code:

```
from transformers import AutoProcessor
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info

MODEL_PATH = "Qwen2-VL"

llm = LLM(
    model=MODEL_PATH,
    limit_mm_per_prompt={"image": 10, "video": 10},
)

sampling_params = SamplingParams(
    temperature=0.1,
    top_p=0.001,
    repetition_penalty=1.05,
    max_tokens=256,
    stop_token_ids=[],
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "/content/drive/MyDrive/LLM/test/Vin_2023-12-22_14-47-37.jpg",
                "min_pixels": 224 * 224,
                "max_pixels": 1280 * 28 * 28,
            },
            {"type": "text", "text":
                                    '''
                                    Please extract the Vehicle Sr No, Engine No, and Model from this image.
                                    Response only json format nothing else.
                                    Analyze the font and double check for similar letters such as "V":"U", "8":"S":"0", "R":"P".
                                    '''
             },
        ],
    },
]

processor = AutoProcessor.from_pretrained(MODEL_PATH)
prompt = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
image_inputs, video_inputs = process_vision_info(messages)

mm_data = {}
if image_inputs is not None:
    mm_data["image"] = image_inputs
if video_inputs is not None:
    mm_data["video"] = video_inputs

llm_inputs = {
    "prompt": prompt,
    "multi_modal_data": mm_data,
}

outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
generated_text = outputs[0].outputs[0].text

print(generated_text)
```

I got this error: 

```
WARNING 10-30 12:06:32 _custom_ops.py:19] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
You are using a model of type qwen2_vl to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
ERROR 10-30 12:06:37 registry.py:264] Error in inspecting model architecture 'Qwen2VLForConditionalGeneration'
ERROR 10-30 12:06:37 registry.py:264] Traceback (most recent call last):
ERROR 10-30 12:06:37 registry.py:264]   File "F:\Mahindra\LLM\vllm\vllm\model_executor\models\registry.py", line 426, in _run_in_subprocess
ERROR 10-30 12:06:37 registry.py:264]     returned.check_returncode()
ERROR 10-30 12:06:37 registry.py:264]   File "C:\Users\bhavy\AppData\Local\Programs\Python\Python310\lib\subprocess.py", line 456, in check_returncode
ERROR 10-30 12:06:37 registry.py:264]     raise CalledProcessError(self.returncode, self.args, self.stdout,
ERROR 10-30 12:06:37 registry.py:264] subprocess.CalledProcessError: Command '['F:\\Mahindra\\LLM\\myenv\\Scripts\\python.exe', '-m', 'vllm.model_executor.models.registry']' returned non-zero exit status 1.
ERROR 10-30 12:06:37 registry.py:264] 
ERROR 10-30 12:06:37 registry.py:264] The above exception was the direct cause of the following exception:
ERROR 10-30 12:06:37 registry.py:264]
ERROR 10-30 12:06:37 registry.py:264] Traceback (most recent call last):
ERROR 10-30 12:06:37 registry.py:264]   File "F:\Mahindra\LLM\vllm\vllm\model_executor\models\registry.py", line 262, in _try_inspect_model_cls        
ERROR 10-30 12:06:37 registry.py:264]     return model.inspect_model_cls()
ERROR 10-30 12:06:37 registry.py:264]   File "F:\Mahindra\LLM\vllm\vllm\model_executor\models\registry.py", line 224, in inspect_model_cls
ERROR 10-30 12:06:37 registry.py:264]     return _run_in_subprocess(
ERROR 10-30 12:06:37 registry.py:264]   File "F:\Mahindra\LLM\vllm\vllm\model_executor\models\registry.py", line 429, in _run_in_subprocess
ERROR 10-30 12:06:37 registry.py:264]     raise RuntimeError(f"Error raised in subprocess:\n"
ERROR 10-30 12:06:37 registry.py:264] RuntimeError: Error raised in subprocess:
ERROR 10-30 12:06:37 registry.py:264] C:\Users\bhavy\AppData\Local\Programs\Python\Python310\lib\runpy.py:126: RuntimeWarning: 'vllm.model_executor.models.registry' found in sys.modules after import of package 'vllm.model_executor.models', but prior to execution of 'vllm.model_executor.models.registry'; this may result in unpredictable behaviour
ERROR 10-30 12:06:37 registry.py:264]   warn(RuntimeWarning(msg))
ERROR 10-30 12:06:37 registry.py:264] Traceback (most recent call last):
ERROR 10-30 12:06:37 registry.py:264]   File "C:\Users\bhavy\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main   
ERROR 10-30 12:06:37 registry.py:264]     return _run_code(code, main_globals, None,
ERROR 10-30 12:06:37 registry.py:264]   File "C:\Users\bhavy\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
ERROR 10-30 12:06:37 registry.py:264]     exec(code, run_globals)
ERROR 10-30 12:06:37 registry.py:264]   File "F:\Mahindra\LLM\vllm\vllm\model_executor\models\registry.py", line 450, in <module>
ERROR 10-30 12:06:37 registry.py:264]     _run()
ERROR 10-30 12:06:37 registry.py:264]   File "F:\Mahindra\LLM\vllm\vllm\model_executor\models\registry.py", line 445, in _run
ERROR 10-30 12:06:37 registry.py:264]     with open(output_file, "wb") as f:
ERROR 10-30 12:06:37 registry.py:264] PermissionError: [Errno 13] Permission denied: 'C:\\Users\\bhavy\\AppData\\Local\\Temp\\tmpjxi5mk75'
ERROR 10-30 12:06:37 registry.py:264]
Traceback (most recent call last):
  File "F:\Mahindra\LLM\vllm\qwen2-vl-vllm-infer.py", line 7, in <module>
    llm = LLM(
  File "F:\Mahindra\LLM\vllm\vllm\entrypoints\llm.py", line 177, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
  File "F:\Mahindra\LLM\vllm\vllm\engine\llm_engine.py", line 571, in from_engine_args
    engine_config = engine_args.create_engine_config()
  File "F:\Mahindra\LLM\vllm\vllm\engine\arg_utils.py", line 900, in create_engine_config
    model_config = self.create_model_config()
  File "F:\Mahindra\LLM\vllm\vllm\engine\arg_utils.py", line 837, in create_model_config
    return ModelConfig(
  File "F:\Mahindra\LLM\vllm\vllm\config.py", line 194, in __init__
    self.multimodal_config = self._init_multimodal_config(
  File "F:\Mahindra\LLM\vllm\vllm\config.py", line 213, in _init_multimodal_config
    if ModelRegistry.is_multimodal_model(architectures):
  File "F:\Mahindra\LLM\vllm\vllm\model_executor\models\registry.py", line 384, in is_multimodal_model
    return self.inspect_model_cls(architectures).supports_multimodal
  File "F:\Mahindra\LLM\vllm\vllm\model_executor\models\registry.py", line 353, in inspect_model_cls
    return self._raise_for_unsupported(architectures)
  File "F:\Mahindra\LLM\vllm\vllm\model_executor\models\registry.py", line 314, in _raise_for_unsupported
    raise ValueError(
ValueError: Model architectures ['Qwen2VLForConditionalGeneration'] are not supported for now. Supported architectures: ['AquilaModel', 'AquilaForCausalLM', 'ArcticForCausalLM', 'BaiChuanForCausalLM', 'BaichuanForCausalLM', 'BloomForCausalLM', 'CohereForCausalLM', 'DbrxForCausalLM', 'DeciLMForCausalLM', 'DeepseekForCausalLM', 'DeepseekV2ForCausalLM', 'ExaoneForCausalLM', 'FalconForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTJForCausalLM', 'GPTNeoXForCausalLM', 'GraniteForCausalLM', 'GraniteMoeForCausalLM', 'InternLMForCausalLM', 'InternLM2ForCausalLM', 'JAISLMHeadModel', 'JambaForCausalLM', 'LlamaForCausalLM', 'LLaMAForCausalLM', 'MambaForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'QuantMixtralForCausalLM', 'MptForCausalLM', 'MPTForCausalLM', 'MiniCPMForCausalLM', 'MiniCPM3ForCausalLM', 'NemotronForCausalLM', 'OlmoForCausalLM', 'OlmoeForCausalLM', 'OPTForCausalLM', 'OrionForCausalLM', 'PersimmonForCausalLM', 'PhiForCausalLM', 'Phi3ForCausalLM', 'Phi3SmallForCausalLM', 'PhiMoEForCausalLM', 'Qwen2ForCausalLM', 'Qwen2MoeForCausalLM', 'RWForCausalLM', 'StableLMEpochForCausalLM', 'StableLmForCausalLM', 'Starcoder2ForCausalLM', 'SolarForCausalLM', 'XverseForCausalLM', 'BartModel', 'BartForConditionalGeneration', 'MistralModel', 'Qwen2ForRewardModel', 'Gemma2Model', 'Blip2ForConditionalGeneration', 'ChameleonForConditionalGeneration', 'ChatGLMModel', 'ChatGLMForConditionalGeneration', 'FuyuForCausalLM', 'InternVLChatModel', 'LlavaForConditionalGeneration', 'LlavaNextForConditionalGeneration', 'LlavaNextVideoForConditionalGeneration', 'LlavaOnevisionForConditionalGeneration', 'MiniCPMV', 'MolmoForCausalLM', 'NVLM_D', 'PaliGemmaForConditionalGeneration', 'Phi3VForCausalLM', 'PixtralForConditionalGeneration', 'QWenLMHeadModel', 'Qwen2VLForConditionalGeneration', 'UltravoxModel', 'MllamaForConditionalGeneration', 'EAGLEModel', 'MedusaModel', 'MLPSpeculatorPreTrainedModel']
```

Note: 
1) "Qwen2VLForConditionalGeneration" is in the list of supported models but still I got the error.
2) collect_env.py says "Is CUDA available: False" but nvcc --version mentions : 
"nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Feb__8_05:53:42_Coordinated_Universal_Time_2023
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.32415258_0"

Can anyone help me with this.

### How would you like to use vllm

I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.


### Before submitting a new issue...

- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Usage]: ValueError: Unexpected weight for Qwen2-VL GPTQ 4-bit custom model. #9832

Your current environment

How would you like to use vllm

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Usage]: ValueError: Unexpected weight for Qwen2-VL GPTQ 4-bit custom model. #9832

Description

Your current environment

How would you like to use vllm

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions