Skip to content

[RFC]: Support for video input  #7558

@TKONIY

Description

@TKONIY

Motivation.

Currently models like llava-hf/llava-next-video* recognize image and video inputs with different tokens, and do different computations. Therefore vLLM should provide new APIs and inference support for video input.

Proposed Change.

API

Roadmap

  • Add VideoPlugin for MultiModalPlugin
  • [model] Support for Llava-Next-Video model #7559
    • Add initial support for replacing a <video> token with a single video.
    • Add support for replacing all <video> and <image> tokens with multiple multi-modal inputs.
  • Support prefix caching for the same videos.
  • Support openai chat completion APIs.

Feedback Period.

A week

CC List.

@DarkLight1337
@zifeitong
@ywang

Any Other Things.

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions