@@ -16,7 +16,7 @@ To input multi-modal data, follow this schema in {class}`vllm.inputs.PromptType`
16
16
- ` prompt ` : The prompt should follow the format that is documented on HuggingFace.
17
17
- ` multi_modal_data ` : This is a dictionary that follows the schema defined in {class}` vllm.multimodal.inputs.MultiModalDataDict ` .
18
18
19
- ### Image
19
+ ### Image Inputs
20
20
21
21
You can pass a single image to the ` 'image' ` field of the multi-modal dictionary, as shown in the following examples:
22
22
@@ -120,20 +120,20 @@ for o in outputs:
120
120
print (generated_text)
121
121
```
122
122
123
- ### Video
123
+ ### Video Inputs
124
124
125
125
You can pass a list of NumPy arrays directly to the ` 'video' ` field of the multi-modal dictionary
126
126
instead of using multi-image input.
127
127
128
128
Full example: < gh-file:examples/offline_inference/vision_language.py >
129
129
130
- ### Audio
130
+ ### Audio Inputs
131
131
132
132
You can pass a tuple ` (array, sampling_rate) ` to the ` 'audio' ` field of the multi-modal dictionary.
133
133
134
134
Full example: < gh-file:examples/offline_inference/audio_language.py >
135
135
136
- ### Embedding
136
+ ### Embedding Inputs
137
137
138
138
To input pre-computed embeddings belonging to a data type (i.e. image, video, or audio) directly to the language model,
139
139
pass a tensor of shape ` (num_items, feature_size, hidden_size of LM) ` to the corresponding field of the multi-modal dictionary.
@@ -211,7 +211,7 @@ The chat template can be inferred based on the documentation on the model's Hugg
211
211
For example, LLaVA-1.5 (` llava-hf/llava-1.5-7b-hf ` ) requires a chat template that can be found here: < gh-file:examples/template_llava.jinja >
212
212
:::
213
213
214
- ### Image
214
+ ### Image Inputs
215
215
216
216
Image input is supported according to [ OpenAI Vision API] ( https://platform.openai.com/docs/guides/vision ) .
217
217
Here is a simple example using Phi-3.5-Vision.
@@ -293,7 +293,7 @@ export VLLM_IMAGE_FETCH_TIMEOUT=<timeout>
293
293
294
294
:::
295
295
296
- ### Video
296
+ ### Video Inputs
297
297
298
298
Instead of ` image_url ` , you can pass a video file via ` video_url ` . Here is a simple example using [ LLaVA-OneVision] ( https://huggingface.co/llava-hf/llava-onevision-qwen2-0.5b-ov-hf ) .
299
299
@@ -356,7 +356,7 @@ export VLLM_VIDEO_FETCH_TIMEOUT=<timeout>
356
356
357
357
:::
358
358
359
- ### Audio
359
+ ### Audio Inputs
360
360
361
361
Audio input is supported according to [ OpenAI Audio API] ( https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-in ) .
362
362
Here is a simple example using Ultravox-v0.5-1B.
@@ -460,77 +460,6 @@ export VLLM_AUDIO_FETCH_TIMEOUT=<timeout>
460
460
461
461
:::
462
462
463
- ### Embedding
463
+ ### Embedding Inputs
464
464
465
- vLLM's Embeddings API is a superset of OpenAI's [ Embeddings API] ( https://platform.openai.com/docs/api-reference/embeddings ) ,
466
- where a list of chat ` messages ` can be passed instead of batched ` inputs ` . This enables multi-modal inputs to be passed to embedding models.
467
-
468
- :::{tip}
469
- The schema of ` messages ` is exactly the same as in Chat Completions API.
470
- You can refer to the above tutorials for more details on how to pass each type of multi-modal data.
471
- :::
472
-
473
- Usually, embedding models do not expect chat-based input, so we need to use a custom chat template to format the text and images.
474
- Refer to the examples below for illustration.
475
-
476
- Here is an end-to-end example using VLM2Vec. To serve the model:
477
-
478
- ``` bash
479
- vllm serve TIGER-Lab/VLM2Vec-Full --task embed \
480
- --trust-remote-code --max-model-len 4096 --chat-template examples/template_vlm2vec.jinja
481
- ```
482
-
483
- :::{important}
484
- Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass ` --task embed `
485
- to run this model in embedding mode instead of text generation mode.
486
-
487
- The custom chat template is completely different from the original one for this model,
488
- and can be found here: < gh-file:examples/template_vlm2vec.jinja >
489
- :::
490
-
491
- Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level ` requests ` library:
492
-
493
- ``` python
494
- import requests
495
-
496
- image_url = " https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
497
-
498
- response = requests.post(
499
- " http://localhost:8000/v1/embeddings" ,
500
- json = {
501
- " model" : " TIGER-Lab/VLM2Vec-Full" ,
502
- " messages" : [{
503
- " role" : " user" ,
504
- " content" : [
505
- {" type" : " image_url" , " image_url" : {" url" : image_url}},
506
- {" type" : " text" , " text" : " Represent the given image." },
507
- ],
508
- }],
509
- " encoding_format" : " float" ,
510
- },
511
- )
512
- response.raise_for_status()
513
- response_json = response.json()
514
- print (" Embedding output:" , response_json[" data" ][0 ][" embedding" ])
515
- ```
516
-
517
- Below is another example, this time using the ` MrLight/dse-qwen2-2b-mrl-v1 ` model.
518
-
519
- ``` bash
520
- vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embed \
521
- --trust-remote-code --max-model-len 8192 --chat-template examples/template_dse_qwen2_vl.jinja
522
- ```
523
-
524
- :::{important}
525
- Like with VLM2Vec, we have to explicitly pass ` --task embed ` .
526
-
527
- Additionally, ` MrLight/dse-qwen2-2b-mrl-v1 ` requires an EOS token for embeddings, which is handled
528
- by a custom chat template: < gh-file:examples/template_dse_qwen2_vl.jinja >
529
- :::
530
-
531
- :::{important}
532
- Also important, ` MrLight/dse-qwen2-2b-mrl-v1 ` requires a placeholder image of the minimum image size for text query embeddings. See the full code
533
- example below for details.
534
- :::
535
-
536
- Full example: < gh-file:examples/online_serving/openai_chat_embedding_client_for_multimodal.py >
465
+ TBD
0 commit comments