Skip to content

[Feature]: Supporting MultiModal inputs using Llama3.1 #8146

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task done
daichi-m opened this issue Sep 4, 2024 · 4 comments
Closed
1 task done

[Feature]: Supporting MultiModal inputs using Llama3.1 #8146

daichi-m opened this issue Sep 4, 2024 · 4 comments
Labels
feature request New feature or request

Comments

@daichi-m
Copy link

daichi-m commented Sep 4, 2024

🚀 The feature, motivation and pitch

We have a deployment of Llama3.1-8B-Instruct and Llama3.1-70B-Instruct models through vLLM hosted in our OnPremise GPU infra.

While testing different use-cases, we realized that the current version of vLLM does not support MultiModal input for Llama3.1 as per this document: https://docs.vllm.ai/en/latest/models/supported_models.html#supported-vlms

Is it possible to enable llama3.1 as a VLM? Or if it can be enabled through any different route, is there any documentation or guide around it?

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@daichi-m daichi-m added the feature request New feature or request label Sep 4, 2024
@DarkLight1337
Copy link
Member

See #7503 (comment)

@DarkLight1337
Copy link
Member

DarkLight1337 commented Sep 4, 2024

We have plans to work on this, but since Meta hasn't released the multimodal variant of Llama 3.1 on HuggingFace yet, there is no rush to complete it.

The main roadblock in the implementation is that we need to support encoder-decoder architecture in multi-modal models. So far, all of the multi-modal models in vLLM insert vision/audio features as tokens into the text tokens before passing them into a decoder-only language model, so we can reuse much of the existing logic for language-only models. This isn't the case for Llama 3.1 which cross-attends directly to the intermediate vision representations.

@DarkLight1337
Copy link
Member

cc @ywang96

@DarkLight1337
Copy link
Member

Closing as completed by #8811

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants