[Bugfix][Multi Modal] Fix broken frames in video input #25881

Jixin10 · 2025-09-29T12:26:49Z

Purpose

Fix #20313

Fix the internal server error when the video input has broken frames.

The engine get frames num via total_frames_num = int(cap.get(cv2.CAP_PROP_FRAME_COUNT)) , however some of the frames are broken and cannot be read via cap.grab() and cap.retrieve(). Therefore i and num_frames are not equal, causing internal server error.

Test Plan

curl -X POST localhost:50799/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model":"qwen2.5_vl",
  "messages": [
      {
        "role": "user",
        "content": [
          {
            "video_url": {
              "url": "{vido_url}"
            },
            "type": "video_url"
          },
          {
            "type": "text",
            "text": "Please tell me what the video describes"
          }
        ]
      }
  ],
  "stream": false,
  "max_tokens": 1000
}'

Test Result

Before fix:

AssertionError: Expected reading 32 frames, but only loaded 31 frames from video.
INFO:     127.0.0.1:26426 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error

After fix:

{"id":"chatcmpl-9246e6c97b934cbb829897080ccc3822","object":"chat.completion","created":1759148580,"model":"qwen2.5_vl","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"The video shows xxx","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":6281,"total_tokens":6469,"completion_tokens":188,"prompt_tokens_details":null},"prompt_logprobs":null}

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: 凌葭 <[email protected]>

github-actions · 2025-09-29T12:26:57Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

DarkLight1337 · 2025-09-29T12:27:59Z

Thanks, can you add a test to prevent regressions?

gemini-code-assist

Code Review

This pull request aims to fix an internal server error caused by broken frames in video inputs, which leads to a mismatch between the expected and actual number of frames. While the approach of pre-validating frames by iterating with cap.grab() is a good step towards robustness, the current implementation introduces several critical issues in the frame reading logic. The new loop for reading frames is not robust against grab or retrieve failures, does not handle cases with duplicated frame indices correctly, and can raise an IndexError for videos with no valid frames. Additionally, the logic for determining a full read is flawed. These issues could lead to crashes or incorrect behavior. I've provided detailed comments and suggestions for fixing these problems. A similar issue likely exists in OpenCVDynamicVideoBackend which should also be addressed.

gemini-code-assist · 2025-09-29T12:29:58Z

vllm/multimodal/video.py

        i = 0
-        for idx in range(total_frames_num):
-            ok = cap.grab()
-            if not ok:
-                break
-            if idx in frame_idx:
-                ret, frame = cap.retrieve()
-                if ret:
-                    frames[i] = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
-                    i += 1
+        cap.set(cv2.CAP_PROP_POS_FRAMES, 0)
+        validate_list_idx = frame_idx[i]
+        target_frame_pos = validate_frames_list[validate_list_idx]
+        for pos in range(total_frames_num):
+            cap.grab()
+            if target_frame_pos != pos:
+                continue
+            ret, frame = cap.retrieve()
+            if ret:
+                frames[i] = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
+                i += 1
+                if i >= len(frame_idx):
+                    break
+                validate_list_idx = frame_idx[i]
+                target_frame_pos = validate_frames_list[validate_list_idx]


The new frame reading loop has several critical issues that make it fragile:

IndexError: If frame_idx is empty (e.g., a video with no valid frames), frame_idx[i] at line 150 will raise an IndexError.

Unhandled grab() failure: The cap.grab() call on line 153 is not checked. If it fails, the loop continues, which can lead to incorrect behavior and an eventual assertion failure.

Incorrect retrieve() failure handling: If cap.retrieve() fails (ret is false), i is not incremented and target_frame_pos is not updated. The loop continues, but pos will never match target_frame_pos again, causing the final assertion to fail.

Duplicate frames bug: If frame_idx contains duplicate indices (which can happen with np.linspace when upsampling), this logic will fail to read the duplicated frames. After reading a frame at a certain position, pos increments, and the same target_frame_pos for the duplicate entry will not be matched again.

A more robust implementation is needed to handle these cases correctly. Here is a suggested replacement that addresses these issues by using a while loop and properly managing state.

Additionally, OpenCVDynamicVideoBackend appears to use a similar, older frame reading loop which is also susceptible to some of these issues and should likely be updated as well.

i = 0 pos = 0 cap.set(cv2.CAP_PROP_POS_FRAMES, 0) while i < len(frame_idx) and pos < total_frames_num: # The frame position in the original video that we want to read. target_pos = validate_frames_list[frame_idx[i]] if pos < target_pos: # Seek forward by grabbing frames. if not cap.grab(): break # End of stream. pos += 1 continue # At this point, pos should be equal to target_pos. # Grab the frame at the current position `pos`. if not cap.grab(): break # End of stream. ret, frame = cap.retrieve() if ret: rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) # This frame might be requested multiple times (upsampling). # Fill all duplicates for this position. while i < len(frame_idx) and validate_frames_list[frame_idx[i]] == target_pos: frames[i] = rgb_frame i += 1 else: # If retrieve fails, we cannot read this frame. # Skip all requests for this frame to avoid getting stuck. while i < len(frame_idx) and validate_frames_list[frame_idx[i]] == target_pos: i += 1 pos += 1

gemini-code-assist · 2025-09-29T12:29:58Z

vllm/multimodal/video.py

+        validate_total_frames_num = len(validate_frames_list)
+
        # resample video to target num_frames
        full_read = num_frames == -1 or total_frames_num < num_frames


The condition for full_read uses total_frames_num, which is the frame count from the video metadata and can be incorrect. This can lead to incorrect sampling behavior when the number of valid frames is less than the requested num_frames but total_frames_num is greater. You should use validate_total_frames_num here to ensure the decision is based on the actual number of readable frames.

Suggested change

full_read = num_frames == -1 or total_frames_num < num_frames

full_read = num_frames == -1 or validate_total_frames_num < num_frames

Isotr0py

Fix the internal server error when the video input has broken frames.

BTW, can model still generate reasonable outputs with broken video? If the video has broken frames, I think the sever failed to fetch the video from url properly?

Isotr0py · 2025-09-29T12:36:18Z

vllm/multimodal/video.py

+        validate_frames_list = []
+        for idx in range(total_frames_num):
+            ok = cap.grab()
+            if ok:
+                validate_frames_list.append(idx)
+


Suggested change

validate_frames_list = []

for idx in range(total_frames_num):

ok = cap.grab()

if ok:

validate_frames_list.append(idx)

validate_frames_list = [

idx for idx in range(total_frames_num)

if cap.grab()

]

Isotr0py · 2025-09-29T12:37:43Z

vllm/multimodal/video.py

+        cap.set(cv2.CAP_PROP_POS_FRAMES, 0)
+        validate_list_idx = frame_idx[i]
+        target_frame_pos = validate_frames_list[validate_list_idx]


I think OpenCVDynamicVideoBackend will also encounter this issue?

Jixin10 · 2025-09-29T12:51:21Z

Fix the internal server error when the video input has broken frames.

BTW, can model still generate reasonable outputs with broken video? If the video has broken frames, I think the sever failed to fetch the video from url properly?

No, the video with broken frames will encounter 505 internal server error because the asset i==num_frames . The engine can download the video successfully. For example, you get 543(0-542) frames in a video but the frame 542 can't be read via cap.grab() (the return value ok is false), you except 32 frames(0, 17, 34 ... 524, 542) but you get 31 frames(0, 17, 34 ... 524).

Jixin10 · 2025-09-29T12:53:41Z

Thanks, can you add a test to prevent regressions?

Let me see how to generate a video with broken frames to reproduce the bug.

Isotr0py · 2025-09-29T13:05:04Z

vllm/multimodal/video.py

+        validate_frames_list = []
+        for idx in range(total_frames_num):
+            ok = cap.grab()
+            if ok:
+                validate_frames_list.append(idx)


BTW, iterate over the whole video can be quite expensive for long video, even if we don't load any frame to memory. I think we should avoid iterating over the whole video twice as much as possible.

BTW, iterate over the whole video can be quite expensive for long video, even if we don't load any frame to memory. I think we should avoid iterating over the whole video twice as much as possible.

Yes, and I test the time cost in my development machine, cap.grab() operation cost 5e-4s , cap.retrieve() cost 1e-3s. I test a video with 543 frames and the result is it cost 0.3s before fix and 0.6s after fix. I tried cap.set(cv2.CAP_PROP_POS_FRAMES, frame_number) but it cost even more time...

BTW, iterate over the whole video can be quite expensive for long video, even if we don't load any frame to memory. I think we should avoid iterating over the whole video twice as much as possible.

Another method I considered is repeat the last frame to fill the frame list, it cost no time.

I think you can just return the slice of valid frames:

valid_num_frames = num_frames for idx in range(total_frames_num): ok = cap.grab() if not ok: valid_frames -= 1 continue if idx in frame_idx: ret, frame = cap.retrieve() if ret: frames[i] = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) i += 1 ... return frames[:valid_num_frames]

fix broken wrong frame in video input

14e45cf

Signed-off-by: 凌葭 <[email protected]>

Jixin10 requested review from DarkLight1337, ywang96 and NickLucche as code owners September 29, 2025 12:26

mergify bot added the multi-modality Related to multi-modality (#4194) label Sep 29, 2025

DarkLight1337 requested a review from Isotr0py September 29, 2025 12:27

gemini-code-assist bot reviewed Sep 29, 2025

View reviewed changes

Isotr0py reviewed Sep 29, 2025

View reviewed changes

LzVv123456 approved these changes Oct 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix][Multi Modal] Fix broken frames in video input #25881

[Bugfix][Multi Modal] Fix broken frames in video input #25881

Jixin10 commented Sep 29, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Sep 29, 2025

Uh oh!

DarkLight1337 commented Sep 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 29, 2025

Uh oh!

gemini-code-assist bot Sep 29, 2025

Uh oh!

Isotr0py left a comment

Uh oh!

Isotr0py Sep 29, 2025

Uh oh!

Isotr0py Sep 29, 2025

Uh oh!

Jixin10 commented Sep 29, 2025 •

edited

Loading

Uh oh!

Jixin10 commented Sep 29, 2025

Uh oh!

Isotr0py Sep 29, 2025 •

edited

Loading

Uh oh!

Jixin10 Sep 29, 2025 •

edited

Loading

Uh oh!

Jixin10 Sep 29, 2025

Uh oh!

Isotr0py Sep 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

	full_read = num_frames == -1 or total_frames_num < num_frames
	full_read = num_frames == -1 or validate_total_frames_num < num_frames

Uh oh!

[Bugfix][Multi Modal] Fix broken frames in video input #25881

Are you sure you want to change the base?

[Bugfix][Multi Modal] Fix broken frames in video input #25881

Conversation

Jixin10 commented Sep 29, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

github-actions bot commented Sep 29, 2025

Uh oh!

DarkLight1337 commented Sep 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Isotr0py left a comment

Choose a reason for hiding this comment

Uh oh!

Isotr0py Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Isotr0py Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Jixin10 commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Jixin10 commented Sep 29, 2025

Uh oh!

Isotr0py Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jixin10 Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jixin10 Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Isotr0py Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Jixin10 commented Sep 29, 2025 •

edited by github-actions bot

Loading

Jixin10 commented Sep 29, 2025 •

edited

Loading

Isotr0py Sep 29, 2025 •

edited

Loading

Jixin10 Sep 29, 2025 •

edited

Loading

Isotr0py Sep 29, 2025 •

edited

Loading