Skip to content

Commit f00446b

Browse files
DarkLight1337Akshat-Tripathi
authored andcommitted
[Doc] Move multimodal Embedding API example to Online Serving page (vllm-project#14017)
Signed-off-by: DarkLight1337 <[email protected]>
1 parent 0e21ae3 commit f00446b

File tree

3 files changed

+89
-84
lines changed

3 files changed

+89
-84
lines changed

docs/source/serving/multimodal_inputs.md

Lines changed: 9 additions & 80 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ To input multi-modal data, follow this schema in {class}`vllm.inputs.PromptType`
1616
- `prompt`: The prompt should follow the format that is documented on HuggingFace.
1717
- `multi_modal_data`: This is a dictionary that follows the schema defined in {class}`vllm.multimodal.inputs.MultiModalDataDict`.
1818

19-
### Image
19+
### Image Inputs
2020

2121
You can pass a single image to the `'image'` field of the multi-modal dictionary, as shown in the following examples:
2222

@@ -120,20 +120,20 @@ for o in outputs:
120120
print(generated_text)
121121
```
122122

123-
### Video
123+
### Video Inputs
124124

125125
You can pass a list of NumPy arrays directly to the `'video'` field of the multi-modal dictionary
126126
instead of using multi-image input.
127127

128128
Full example: <gh-file:examples/offline_inference/vision_language.py>
129129

130-
### Audio
130+
### Audio Inputs
131131

132132
You can pass a tuple `(array, sampling_rate)` to the `'audio'` field of the multi-modal dictionary.
133133

134134
Full example: <gh-file:examples/offline_inference/audio_language.py>
135135

136-
### Embedding
136+
### Embedding Inputs
137137

138138
To input pre-computed embeddings belonging to a data type (i.e. image, video, or audio) directly to the language model,
139139
pass a tensor of shape `(num_items, feature_size, hidden_size of LM)` to the corresponding field of the multi-modal dictionary.
@@ -211,7 +211,7 @@ The chat template can be inferred based on the documentation on the model's Hugg
211211
For example, LLaVA-1.5 (`llava-hf/llava-1.5-7b-hf`) requires a chat template that can be found here: <gh-file:examples/template_llava.jinja>
212212
:::
213213

214-
### Image
214+
### Image Inputs
215215

216216
Image input is supported according to [OpenAI Vision API](https://platform.openai.com/docs/guides/vision).
217217
Here is a simple example using Phi-3.5-Vision.
@@ -293,7 +293,7 @@ export VLLM_IMAGE_FETCH_TIMEOUT=<timeout>
293293

294294
:::
295295

296-
### Video
296+
### Video Inputs
297297

298298
Instead of `image_url`, you can pass a video file via `video_url`. Here is a simple example using [LLaVA-OneVision](https://huggingface.co/llava-hf/llava-onevision-qwen2-0.5b-ov-hf).
299299

@@ -356,7 +356,7 @@ export VLLM_VIDEO_FETCH_TIMEOUT=<timeout>
356356

357357
:::
358358

359-
### Audio
359+
### Audio Inputs
360360

361361
Audio input is supported according to [OpenAI Audio API](https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-in).
362362
Here is a simple example using Ultravox-v0.5-1B.
@@ -460,77 +460,6 @@ export VLLM_AUDIO_FETCH_TIMEOUT=<timeout>
460460

461461
:::
462462

463-
### Embedding
463+
### Embedding Inputs
464464

465-
vLLM's Embeddings API is a superset of OpenAI's [Embeddings API](https://platform.openai.com/docs/api-reference/embeddings),
466-
where a list of chat `messages` can be passed instead of batched `inputs`. This enables multi-modal inputs to be passed to embedding models.
467-
468-
:::{tip}
469-
The schema of `messages` is exactly the same as in Chat Completions API.
470-
You can refer to the above tutorials for more details on how to pass each type of multi-modal data.
471-
:::
472-
473-
Usually, embedding models do not expect chat-based input, so we need to use a custom chat template to format the text and images.
474-
Refer to the examples below for illustration.
475-
476-
Here is an end-to-end example using VLM2Vec. To serve the model:
477-
478-
```bash
479-
vllm serve TIGER-Lab/VLM2Vec-Full --task embed \
480-
--trust-remote-code --max-model-len 4096 --chat-template examples/template_vlm2vec.jinja
481-
```
482-
483-
:::{important}
484-
Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass `--task embed`
485-
to run this model in embedding mode instead of text generation mode.
486-
487-
The custom chat template is completely different from the original one for this model,
488-
and can be found here: <gh-file:examples/template_vlm2vec.jinja>
489-
:::
490-
491-
Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library:
492-
493-
```python
494-
import requests
495-
496-
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
497-
498-
response = requests.post(
499-
"http://localhost:8000/v1/embeddings",
500-
json={
501-
"model": "TIGER-Lab/VLM2Vec-Full",
502-
"messages": [{
503-
"role": "user",
504-
"content": [
505-
{"type": "image_url", "image_url": {"url": image_url}},
506-
{"type": "text", "text": "Represent the given image."},
507-
],
508-
}],
509-
"encoding_format": "float",
510-
},
511-
)
512-
response.raise_for_status()
513-
response_json = response.json()
514-
print("Embedding output:", response_json["data"][0]["embedding"])
515-
```
516-
517-
Below is another example, this time using the `MrLight/dse-qwen2-2b-mrl-v1` model.
518-
519-
```bash
520-
vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embed \
521-
--trust-remote-code --max-model-len 8192 --chat-template examples/template_dse_qwen2_vl.jinja
522-
```
523-
524-
:::{important}
525-
Like with VLM2Vec, we have to explicitly pass `--task embed`.
526-
527-
Additionally, `MrLight/dse-qwen2-2b-mrl-v1` requires an EOS token for embeddings, which is handled
528-
by a custom chat template: <gh-file:examples/template_dse_qwen2_vl.jinja>
529-
:::
530-
531-
:::{important}
532-
Also important, `MrLight/dse-qwen2-2b-mrl-v1` requires a placeholder image of the minimum image size for text query embeddings. See the full code
533-
example below for details.
534-
:::
535-
536-
Full example: <gh-file:examples/online_serving/openai_chat_embedding_client_for_multimodal.py>
465+
TBD

docs/source/serving/openai_compatible_server.md

Lines changed: 77 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -266,11 +266,85 @@ you can use the [official OpenAI Python client](https://github.com/openai/openai
266266
If the model has a [chat template](#chat-template), you can replace `inputs` with a list of `messages` (same schema as [Chat API](#chat-api))
267267
which will be treated as a single prompt to the model.
268268

269-
:::{tip}
270-
This enables multi-modal inputs to be passed to embedding models, see [this page](#multimodal-inputs) for details.
269+
Code example: <gh-file:examples/online_serving/openai_embedding_client.py>
270+
271+
#### Multi-modal inputs
272+
273+
You can pass multi-modal inputs to embedding models by defining a custom chat template for the server
274+
and passing a list of `messages` in the request. Refer to the examples below for illustration.
275+
276+
:::::{tab-set}
277+
::::{tab-item} VLM2Vec
278+
279+
To serve the model:
280+
281+
```bash
282+
vllm serve TIGER-Lab/VLM2Vec-Full --task embed \
283+
--trust-remote-code --max-model-len 4096 --chat-template examples/template_vlm2vec.jinja
284+
```
285+
286+
:::{important}
287+
Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass `--task embed`
288+
to run this model in embedding mode instead of text generation mode.
289+
290+
The custom chat template is completely different from the original one for this model,
291+
and can be found here: <gh-file:examples/template_vlm2vec.jinja>
271292
:::
272293

273-
Code example: <gh-file:examples/online_serving/openai_embedding_client.py>
294+
Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library:
295+
296+
```python
297+
import requests
298+
299+
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
300+
301+
response = requests.post(
302+
"http://localhost:8000/v1/embeddings",
303+
json={
304+
"model": "TIGER-Lab/VLM2Vec-Full",
305+
"messages": [{
306+
"role": "user",
307+
"content": [
308+
{"type": "image_url", "image_url": {"url": image_url}},
309+
{"type": "text", "text": "Represent the given image."},
310+
],
311+
}],
312+
"encoding_format": "float",
313+
},
314+
)
315+
response.raise_for_status()
316+
response_json = response.json()
317+
print("Embedding output:", response_json["data"][0]["embedding"])
318+
```
319+
320+
::::
321+
322+
::::{tab-item} DSE-Qwen2-MRL
323+
324+
To serve the model:
325+
326+
```bash
327+
vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embed \
328+
--trust-remote-code --max-model-len 8192 --chat-template examples/template_dse_qwen2_vl.jinja
329+
```
330+
331+
:::{important}
332+
Like with VLM2Vec, we have to explicitly pass `--task embed`.
333+
334+
Additionally, `MrLight/dse-qwen2-2b-mrl-v1` requires an EOS token for embeddings, which is handled
335+
by a custom chat template: <gh-file:examples/template_dse_qwen2_vl.jinja>
336+
:::
337+
338+
:::{important}
339+
`MrLight/dse-qwen2-2b-mrl-v1` requires a placeholder image of the minimum image size for text query embeddings. See the full code
340+
example below for details.
341+
:::
342+
343+
::::
344+
345+
:::::
346+
347+
Full example: <gh-file:examples/online_serving/openai_chat_embedding_client_for_multimodal.py>
274348

275349
#### Extra parameters
276350

vllm/model_executor/models/registry.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919
import torch.nn as nn
2020

2121
from vllm.logger import init_logger
22+
from vllm.utils import is_in_doc_build
2223

2324
from .interfaces import (has_inner_state, is_attention_free, is_hybrid,
2425
supports_cross_encoding, supports_multimodal,
@@ -368,7 +369,8 @@ def register_model(
368369
raise ValueError(msg)
369370

370371
model = _LazyRegisteredModel(*split_str)
371-
elif isinstance(model_cls, type) and issubclass(model_cls, nn.Module):
372+
elif isinstance(model_cls, type) and (is_in_doc_build() or issubclass(
373+
model_cls, nn.Module)):
372374
model = _RegisteredModel.from_model_cls(model_cls)
373375
else:
374376
msg = ("`model_cls` should be a string or PyTorch model class, "

0 commit comments

Comments
 (0)