Skip to content

Support multimodals models with vLLM #3670

Closed
@mudler

Description

@mudler

Is your feature request related to a problem? Please describe.
Many models are now becoming multi-model, that is they can accept images, videos or audio during inference. The llama.cpp project is currently providing multimodal support and we do as well by using it, however there are models which aren't supported yet (for instance #3535 and #3669, see also ggml-org/llama.cpp#9455 )

Describe the solution you'd like
LocalAI to support vLLM multimodal capabilities

Describe alternatives you've considered

Additional context
See #3535 and #3669, tangentially related to: #2318 #3602

https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_pixtral.py

https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language_multi_image.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions