[RFC]: Support for video input 

### Motivation.

Currently models like `llava-hf/llava-next-video*` recognize image and video inputs with different tokens, and do different computations. Therefore vLLM should provide new APIs and inference support for video input.

### Proposed Change.

#### API
* `LLM.generate()` API for video

    ```python
    LLM.generate({
        "prompt": "<video> please summarize this video",
        "multi_modal_data": {
            "video": video
        }
    })
    ```
* [OpenAI compatible chat completion APIs](https://cookbook.openai.com/examples/gpt_with_vision_for_video_understanding) 

#### Roadmap
* Add `VideoPlugin` for `MultiModalPlugin`
* #7559 
   * Add initial support for replacing a \<video\> token with a single video.
   * Add support for replacing all \<video\> and \<image\> tokens with multiple multi-modal inputs.
* Support prefix caching for the same videos.
* Support openai chat completion APIs.


### Feedback Period.

A week

### CC List.

@DarkLight1337 
@zifeitong 
@ywang

### Any Other Things.

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: Support for video input #7558

Motivation.

Proposed Change.

API

Roadmap

Feedback Period.

CC List.

Any Other Things.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: Support for video input #7558

Description

Motivation.

Proposed Change.

API

Roadmap

Feedback Period.

CC List.

Any Other Things.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions