Support multimodals models with vLLM

**Is your feature request related to a problem? Please describe.**
Many models are now becoming multi-model, that is they can accept images, videos or audio during inference. The llama.cpp project is currently providing multimodal support and we do as well by using it, however there are models which aren't supported yet (for instance #3535 and #3669, see also https://github.com/ggerganov/llama.cpp/issues/9455 )

**Describe the solution you'd like**
LocalAI to support vLLM multimodal capabilities

**Describe alternatives you've considered**


**Additional context**
See #3535 and #3669, tangentially related to: #2318 #3602 

https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_pixtral.py

https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language_multi_image.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Support multimodals models with vLLM #3670

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Support multimodals models with vLLM #3670

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions