Skip to content

Feature Request: Support multimodal LLMs such as Qwen2.5-VL as embedding models #13247

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
4 tasks done
cebtenzzre opened this issue May 1, 2025 · 3 comments
Open
4 tasks done
Labels
enhancement New feature or request

Comments

@cebtenzzre
Copy link
Collaborator

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

llama.cpp should support multimodal models built upon architectures such as Qwen2.5-VL for image and text embeddings.

Motivation

Multimodal LLMs demonstrate better alignment between image and text embeddings than constrastively trained models such as CLIP, which suffer from a modality gap (text compares better with unrelated text than it does with a related image).

Nomic's latest vision models are designed for PDF document retrieval. nomic-embed-multimodal-3b, which generates a single embedding per rasterized PDF page, is already supported by vLLM as it is compatible with the Qwen2-VL embedding model tested here. It is not yet supported by llama.cpp.

Possible Implementation

This would build upon #13209 which adds vision support for Qwen2.5-VL. Also relevant is #12898 which brings vision to the llama.cpp server and would make the embeddings useful in practice, since you can't do much with just one embedding generated via llama-embedding or similar.

@cebtenzzre cebtenzzre added the enhancement New feature or request label May 1, 2025
@ngxson
Copy link
Collaborator

ngxson commented May 2, 2025

Btw, does nomic-embed-multimodal-3b use a causal mask?

If this the case, then it will be very simple, we just need one call to mtmd_eval and then get the embeddings

@cebtenzzre
Copy link
Collaborator Author

Btw, does nomic-embed-multimodal-3b use a causal mask?

Yes, it uses causal attention. The only required change compared to a standard VLM is to do pooling on the final hidden states (using the last token by default). Also, it is important (but not required) to be able to process a batch of differently sized images at once and get a batch of embeddings, for optimal speed.

@ngxson
Copy link
Collaborator

ngxson commented May 3, 2025

Also, it is important (but not required) to be able to process a batch of differently sized images at once and get a batch of embeddings, for optimal speed.

I'm doubt if this is possible in the short term, mainly because:

  • clip.cpp does not support batch encoding atm, since it may use too much memory
  • llama_batch does not allow input both text + embedded image, which is needed if you have begin/end-of-image token. Though, we already planned to support this case, but it is still quite complicated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants