[Bug]: vllm serve works incorrect for (some) Vision LM models 

### Your current environment

<details>
<summary>The output of `python collect_env.py`</summary>

```text
Collecting environment information...
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-124-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA H100 80GB HBM3
GPU 1: NVIDIA H100 80GB HBM3
GPU 2: NVIDIA H100 80GB HBM3
GPU 3: NVIDIA H100 80GB HBM3

Nvidia driver version: 565.57.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        43 bits physical, 57 bits virtual
Byte Order:                           Little Endian
CPU(s):                               80
On-line CPU(s) list:                  0-79
Vendor ID:                            GenuineIntel
Model name:                           Intel Xeon Processor (Icelake)
CPU family:                           6
Model:                                106
Thread(s) per core:                   2
Core(s) per socket:                   40
Socket(s):                            1
Stepping:                             0
BogoMIPS:                             4200.00
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 wbnoinvd arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid fsrm md_clear arch_capabilities
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            2.5 MiB (80 instances)
L1i cache:                            2.5 MiB (80 instances)
L2 cache:                             160 MiB (40 instances)
L3 cache:                             16 MiB (1 instance)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-79
Vulnerability Gather data sampling:   Unknown: Dependent on hypervisor status
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI SW loop, KVM SW loop
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pynvml==11.5.3
[pip3] pytorch-triton==3.1.0+cf34004b8a
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0
[pip3] torchaudio==2.5.0.dev20241105+cu121
[pip3] torchvision==0.19.0
[pip3] transformers==4.46.2
[pip3] triton==3.0.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.6.3.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    0-79    0               N/A
GPU1    NV18     X      NV18    NV18    0-79    0               N/A
GPU2    NV18    NV18     X      NV18    0-79    0               N/A
GPU3    NV18    NV18    NV18     X      0-79    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

LD_LIBRARY_PATH=/mnt/share/ai_studio/.venv/lib/python3.11/site-packages/cv2/../../lib64:/usr/local/cuda-12.1/lib64:
CUDA_MODULE_LOADING=LAZY
```

</details>


### Model Input Dumps

_No response_

### 🐛 Describe the bug

I am running a Vision LM model `llava-hf/llava-1.5-13b-hf` via `vllm serve`, and it outputs weird outputs: [official script from vllm examples](https://github.com/vllm-project/vllm/blob/3945c82346dae3129213607663bfd17edd905fef/examples/openai_chat_completion_client_for_multimodal.py#L64) with somewhat "fixed" `top_p` for better determinism outputs only '\n' tokens:
```
image_url = "https://wallpapers.com/images/featured/high-resolution-gfinds1akzwf6vcq.jpg"
chat_completion_from_url = client.chat.completions.create(
    messages=[{
        "role":
        "user",
        "content": [
            {
                "type": "text",
                "text": "hey"
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": image_url
                },
            },
        ],
    }],
    model="llava-hf/llava-1.5-13b-hf",
    max_tokens=32,
    top_p=0.1
)

result = chat_completion_from_url.choices[0].message.content
print("Chat completion output from image url:", result)

# This outputs the '\n' token 32 times.
```

I launch the vllm server according to[ this official script](https://github.com/vllm-project/vllm/blob/3945c82346dae3129213607663bfd17edd905fef/examples/openai_chat_completion_client_for_multimodal.py#L7): 
```
vllm serve llava-hf/llava-1.5-13b-hf --chat-template template_llava.jinja
```

Crucially, running the vllm server via Jupyter-notebook yields completely normal outputs, which coincide with outputs, obtained via HuggingFace's transformers from [the official Llava's example 
](https://colab.research.google.com/drive/1qsl6cd2c8gGtEW1xV5io7S8NHh-Cp1TV?usp=sharing):

```
from vllm import LLM, SamplingParams
from PIL import Image
import requests

image_url = "https://wallpapers.com/images/featured/high-resolution-gfinds1akzwf6vcq.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)

llm = LLM(model="llava-hf/llava-1.5-13b-hf")
sampling_params = SamplingParams(top_p=0.1, max_tokens=32)

prompt = "USER: <image>\nhey\nASSISTANT:"

outputs = llm.generate(
    {
        "prompt": prompt,
        "multi_modal_data": {"image": image},
    },
    sampling_params=sampling_params
)
print(outputs[0].outputs[0].text)

# This outputs "The image features a beautiful landscape with a large body of water, such as a lake or a river, surrounded by lush green trees and mountains. The water"
```

The inputs to the text encoder are completely normal, according to the logs:
```
Received request chat-7832348944684bcf9d8abb7197872fab: prompt: '<s>USER: <image>\nhey\nASSISTANT:\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=0.1, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=16, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), guided_decoding=GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=None, backend=None, whitespace_pattern=None), prompt_token_ids: [1, 3148, 1001, 29901, 29871, 32000, 29871, 13, 354, 29891, 13, 22933, 9047, 13566, 29901, 13], lora_request: None, prompt_adapter_request: None
```

Hence, I have a certain feeling there is a bug in how an image is processed when launching the vllm server via `vllm serve`. Could you please investigate?


### Before submitting a new issue...

- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: vllm serve works incorrect for (some) Vision LM models #10286

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: vllm serve works incorrect for (some) Vision LM models #10286

Description

Your current environment

Model Input Dumps

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions