-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
Closed
Labels
Description
Motivation.
Currently models like llava-hf/llava-next-video*
recognize image and video inputs with different tokens, and do different computations. Therefore vLLM should provide new APIs and inference support for video input.
Proposed Change.
API
-
LLM.generate()
API for videoLLM.generate({ "prompt": "<video> please summarize this video", "multi_modal_data": { "video": video } })
Roadmap
- Add
VideoPlugin
forMultiModalPlugin
- [model] Support for Llava-Next-Video model #7559
- Add initial support for replacing a <video> token with a single video.
- Add support for replacing all <video> and <image> tokens with multiple multi-modal inputs.
- Support prefix caching for the same videos.
- Support openai chat completion APIs.
Feedback Period.
A week
CC List.
@DarkLight1337
@zifeitong
@ywang
Any Other Things.
No response
ArlindNocajDarkLight1337, KuntaiDu, ywang96, PancakeAwesome and ArlindNocaj