Skip to content

Conversation

Jixin10
Copy link

@Jixin10 Jixin10 commented Sep 29, 2025

Purpose

Fix #20313

Fix the internal server error when the video input has broken frames.

The engine get frames num via total_frames_num = int(cap.get(cv2.CAP_PROP_FRAME_COUNT)) , however some of the frames are broken and cannot be read via cap.grab() and cap.retrieve(). Therefore i and num_frames are not equal, causing internal server error.

Test Plan

curl -X POST localhost:50799/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model":"qwen2.5_vl",
  "messages": [
      {
        "role": "user",
        "content": [
          {
            "video_url": {
              "url": "{vido_url}"
            },
            "type": "video_url"
          },
          {
            "type": "text",
            "text": "Please tell me what the video describes"
          }
        ]
      }
  ],
  "stream": false,
  "max_tokens": 1000
}'

Test Result

Before fix:

AssertionError: Expected reading 32 frames, but only loaded 31 frames from video.
INFO:     127.0.0.1:26426 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error

After fix:

{"id":"chatcmpl-9246e6c97b934cbb829897080ccc3822","object":"chat.completion","created":1759148580,"model":"qwen2.5_vl","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"The video shows xxx","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":6281,"total_tokens":6469,"completion_tokens":188,"prompt_tokens_details":null},"prompt_logprobs":null}

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mergify mergify bot added the multi-modality Related to multi-modality (#4194) label Sep 29, 2025
@DarkLight1337
Copy link
Member

Thanks, can you add a test to prevent regressions?

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix an internal server error caused by broken frames in video inputs, which leads to a mismatch between the expected and actual number of frames. While the approach of pre-validating frames by iterating with cap.grab() is a good step towards robustness, the current implementation introduces several critical issues in the frame reading logic. The new loop for reading frames is not robust against grab or retrieve failures, does not handle cases with duplicated frame indices correctly, and can raise an IndexError for videos with no valid frames. Additionally, the logic for determining a full read is flawed. These issues could lead to crashes or incorrect behavior. I've provided detailed comments and suggestions for fixing these problems. A similar issue likely exists in OpenCVDynamicVideoBackend which should also be addressed.

Comment on lines 148 to +163
i = 0
for idx in range(total_frames_num):
ok = cap.grab()
if not ok:
break
if idx in frame_idx:
ret, frame = cap.retrieve()
if ret:
frames[i] = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
i += 1
cap.set(cv2.CAP_PROP_POS_FRAMES, 0)
validate_list_idx = frame_idx[i]
target_frame_pos = validate_frames_list[validate_list_idx]
for pos in range(total_frames_num):
cap.grab()
if target_frame_pos != pos:
continue
ret, frame = cap.retrieve()
if ret:
frames[i] = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
i += 1
if i >= len(frame_idx):
break
validate_list_idx = frame_idx[i]
target_frame_pos = validate_frames_list[validate_list_idx]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The new frame reading loop has several critical issues that make it fragile:

  1. IndexError: If frame_idx is empty (e.g., a video with no valid frames), frame_idx[i] at line 150 will raise an IndexError.
  2. Unhandled grab() failure: The cap.grab() call on line 153 is not checked. If it fails, the loop continues, which can lead to incorrect behavior and an eventual assertion failure.
  3. Incorrect retrieve() failure handling: If cap.retrieve() fails (ret is false), i is not incremented and target_frame_pos is not updated. The loop continues, but pos will never match target_frame_pos again, causing the final assertion to fail.
  4. Duplicate frames bug: If frame_idx contains duplicate indices (which can happen with np.linspace when upsampling), this logic will fail to read the duplicated frames. After reading a frame at a certain position, pos increments, and the same target_frame_pos for the duplicate entry will not be matched again.

A more robust implementation is needed to handle these cases correctly. Here is a suggested replacement that addresses these issues by using a while loop and properly managing state.

Additionally, OpenCVDynamicVideoBackend appears to use a similar, older frame reading loop which is also susceptible to some of these issues and should likely be updated as well.

        i = 0
        pos = 0
        cap.set(cv2.CAP_PROP_POS_FRAMES, 0)
        while i < len(frame_idx) and pos < total_frames_num:
            # The frame position in the original video that we want to read.
            target_pos = validate_frames_list[frame_idx[i]]

            if pos < target_pos:
                # Seek forward by grabbing frames.
                if not cap.grab():
                    break  # End of stream.
                pos += 1
                continue

            # At this point, pos should be equal to target_pos.
            # Grab the frame at the current position `pos`.
            if not cap.grab():
                break  # End of stream.

            ret, frame = cap.retrieve()
            if ret:
                rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
                # This frame might be requested multiple times (upsampling).
                # Fill all duplicates for this position.
                while i < len(frame_idx) and validate_frames_list[frame_idx[i]] == target_pos:
                    frames[i] = rgb_frame
                    i += 1
            else:
                # If retrieve fails, we cannot read this frame.
                # Skip all requests for this frame to avoid getting stuck.
                while i < len(frame_idx) and validate_frames_list[frame_idx[i]] == target_pos:
                    i += 1
            
            pos += 1

validate_total_frames_num = len(validate_frames_list)

# resample video to target num_frames
full_read = num_frames == -1 or total_frames_num < num_frames
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The condition for full_read uses total_frames_num, which is the frame count from the video metadata and can be incorrect. This can lead to incorrect sampling behavior when the number of valid frames is less than the requested num_frames but total_frames_num is greater. You should use validate_total_frames_num here to ensure the decision is based on the actual number of readable frames.

Suggested change
full_read = num_frames == -1 or total_frames_num < num_frames
full_read = num_frames == -1 or validate_total_frames_num < num_frames

Copy link
Member

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the internal server error when the video input has broken frames.

BTW, can model still generate reasonable outputs with broken video? If the video has broken frames, I think the sever failed to fetch the video from url properly?

Comment on lines +124 to +129
validate_frames_list = []
for idx in range(total_frames_num):
ok = cap.grab()
if ok:
validate_frames_list.append(idx)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
validate_frames_list = []
for idx in range(total_frames_num):
ok = cap.grab()
if ok:
validate_frames_list.append(idx)
validate_frames_list = [
idx for idx in range(total_frames_num)
if cap.grab()
]

Comment on lines +149 to +151
cap.set(cv2.CAP_PROP_POS_FRAMES, 0)
validate_list_idx = frame_idx[i]
target_frame_pos = validate_frames_list[validate_list_idx]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think OpenCVDynamicVideoBackend will also encounter this issue?

@Jixin10
Copy link
Author

Jixin10 commented Sep 29, 2025

Fix the internal server error when the video input has broken frames.

BTW, can model still generate reasonable outputs with broken video? If the video has broken frames, I think the sever failed to fetch the video from url properly?

No, the video with broken frames will encounter 505 internal server error because the asset i==num_frames . The engine can download the video successfully. For example, you get 543(0-542) frames in a video but the frame 542 can't be read via cap.grab() (the return value ok is false), you except 32 frames(0, 17, 34 ... 524, 542) but you get 31 frames(0, 17, 34 ... 524).

@Jixin10
Copy link
Author

Jixin10 commented Sep 29, 2025

Thanks, can you add a test to prevent regressions?

Let me see how to generate a video with broken frames to reproduce the bug.

Comment on lines +124 to +128
validate_frames_list = []
for idx in range(total_frames_num):
ok = cap.grab()
if ok:
validate_frames_list.append(idx)
Copy link
Member

@Isotr0py Isotr0py Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, iterate over the whole video can be quite expensive for long video, even if we don't load any frame to memory. I think we should avoid iterating over the whole video twice as much as possible.

Copy link
Author

@Jixin10 Jixin10 Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, iterate over the whole video can be quite expensive for long video, even if we don't load any frame to memory. I think we should avoid iterating over the whole video twice as much as possible.

Yes, and I test the time cost in my development machine, cap.grab() operation cost 5e-4s , cap.retrieve() cost 1e-3s. I test a video with 543 frames and the result is it cost 0.3s before fix and 0.6s after fix. I tried cap.set(cv2.CAP_PROP_POS_FRAMES, frame_number) but it cost even more time...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, iterate over the whole video can be quite expensive for long video, even if we don't load any frame to memory. I think we should avoid iterating over the whole video twice as much as possible.

Another method I considered is repeat the last frame to fill the frame list, it cost no time.

Copy link
Member

@Isotr0py Isotr0py Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just return the slice of valid frames:

        valid_num_frames = num_frames
        for idx in range(total_frames_num):
            ok = cap.grab()
            if not ok:
                valid_frames -= 1
                continue
            if idx in frame_idx:
                ret, frame = cap.retrieve()
                if ret:
                    frames[i] = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
                    i += 1
            ...
            return frames[:valid_num_frames]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multi-modality Related to multi-modality (#4194)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]:qwen2_5vl: Internal Server Error when processing short video and vllm has been installed 0.9.0
4 participants