-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[Bugfix][Multi Modal] Fix broken frames in video input #25881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: 凌葭 <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
Thanks, can you add a test to prevent regressions? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix an internal server error caused by broken frames in video inputs, which leads to a mismatch between the expected and actual number of frames. While the approach of pre-validating frames by iterating with cap.grab()
is a good step towards robustness, the current implementation introduces several critical issues in the frame reading logic. The new loop for reading frames is not robust against grab
or retrieve
failures, does not handle cases with duplicated frame indices correctly, and can raise an IndexError
for videos with no valid frames. Additionally, the logic for determining a full read is flawed. These issues could lead to crashes or incorrect behavior. I've provided detailed comments and suggestions for fixing these problems. A similar issue likely exists in OpenCVDynamicVideoBackend
which should also be addressed.
i = 0 | ||
for idx in range(total_frames_num): | ||
ok = cap.grab() | ||
if not ok: | ||
break | ||
if idx in frame_idx: | ||
ret, frame = cap.retrieve() | ||
if ret: | ||
frames[i] = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) | ||
i += 1 | ||
cap.set(cv2.CAP_PROP_POS_FRAMES, 0) | ||
validate_list_idx = frame_idx[i] | ||
target_frame_pos = validate_frames_list[validate_list_idx] | ||
for pos in range(total_frames_num): | ||
cap.grab() | ||
if target_frame_pos != pos: | ||
continue | ||
ret, frame = cap.retrieve() | ||
if ret: | ||
frames[i] = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) | ||
i += 1 | ||
if i >= len(frame_idx): | ||
break | ||
validate_list_idx = frame_idx[i] | ||
target_frame_pos = validate_frames_list[validate_list_idx] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new frame reading loop has several critical issues that make it fragile:
IndexError
: Ifframe_idx
is empty (e.g., a video with no valid frames),frame_idx[i]
at line 150 will raise anIndexError
.- Unhandled
grab()
failure: Thecap.grab()
call on line 153 is not checked. If it fails, the loop continues, which can lead to incorrect behavior and an eventual assertion failure. - Incorrect
retrieve()
failure handling: Ifcap.retrieve()
fails (ret
is false),i
is not incremented andtarget_frame_pos
is not updated. The loop continues, butpos
will never matchtarget_frame_pos
again, causing the final assertion to fail. - Duplicate frames bug: If
frame_idx
contains duplicate indices (which can happen withnp.linspace
when upsampling), this logic will fail to read the duplicated frames. After reading a frame at a certain position,pos
increments, and the sametarget_frame_pos
for the duplicate entry will not be matched again.
A more robust implementation is needed to handle these cases correctly. Here is a suggested replacement that addresses these issues by using a while
loop and properly managing state.
Additionally, OpenCVDynamicVideoBackend
appears to use a similar, older frame reading loop which is also susceptible to some of these issues and should likely be updated as well.
i = 0
pos = 0
cap.set(cv2.CAP_PROP_POS_FRAMES, 0)
while i < len(frame_idx) and pos < total_frames_num:
# The frame position in the original video that we want to read.
target_pos = validate_frames_list[frame_idx[i]]
if pos < target_pos:
# Seek forward by grabbing frames.
if not cap.grab():
break # End of stream.
pos += 1
continue
# At this point, pos should be equal to target_pos.
# Grab the frame at the current position `pos`.
if not cap.grab():
break # End of stream.
ret, frame = cap.retrieve()
if ret:
rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
# This frame might be requested multiple times (upsampling).
# Fill all duplicates for this position.
while i < len(frame_idx) and validate_frames_list[frame_idx[i]] == target_pos:
frames[i] = rgb_frame
i += 1
else:
# If retrieve fails, we cannot read this frame.
# Skip all requests for this frame to avoid getting stuck.
while i < len(frame_idx) and validate_frames_list[frame_idx[i]] == target_pos:
i += 1
pos += 1
validate_total_frames_num = len(validate_frames_list) | ||
|
||
# resample video to target num_frames | ||
full_read = num_frames == -1 or total_frames_num < num_frames |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition for full_read
uses total_frames_num
, which is the frame count from the video metadata and can be incorrect. This can lead to incorrect sampling behavior when the number of valid frames is less than the requested num_frames
but total_frames_num
is greater. You should use validate_total_frames_num
here to ensure the decision is based on the actual number of readable frames.
full_read = num_frames == -1 or total_frames_num < num_frames | |
full_read = num_frames == -1 or validate_total_frames_num < num_frames |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the internal server error when the video input has broken frames.
BTW, can model still generate reasonable outputs with broken video? If the video has broken frames, I think the sever failed to fetch the video from url properly?
validate_frames_list = [] | ||
for idx in range(total_frames_num): | ||
ok = cap.grab() | ||
if ok: | ||
validate_frames_list.append(idx) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
validate_frames_list = [] | |
for idx in range(total_frames_num): | |
ok = cap.grab() | |
if ok: | |
validate_frames_list.append(idx) | |
validate_frames_list = [ | |
idx for idx in range(total_frames_num) | |
if cap.grab() | |
] |
cap.set(cv2.CAP_PROP_POS_FRAMES, 0) | ||
validate_list_idx = frame_idx[i] | ||
target_frame_pos = validate_frames_list[validate_list_idx] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think OpenCVDynamicVideoBackend
will also encounter this issue?
No, the video with broken frames will encounter 505 internal server error because the |
Let me see how to generate a video with broken frames to reproduce the bug. |
validate_frames_list = [] | ||
for idx in range(total_frames_num): | ||
ok = cap.grab() | ||
if ok: | ||
validate_frames_list.append(idx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, iterate over the whole video can be quite expensive for long video, even if we don't load any frame to memory. I think we should avoid iterating over the whole video twice as much as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, iterate over the whole video can be quite expensive for long video, even if we don't load any frame to memory. I think we should avoid iterating over the whole video twice as much as possible.
Yes, and I test the time cost in my development machine, cap.grab()
operation cost 5e-4
s , cap.retrieve()
cost 1e-3
s. I test a video with 543 frames and the result is it cost 0.3s before fix and 0.6s after fix. I tried cap.set(cv2.CAP_PROP_POS_FRAMES, frame_number)
but it cost even more time...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, iterate over the whole video can be quite expensive for long video, even if we don't load any frame to memory. I think we should avoid iterating over the whole video twice as much as possible.
Another method I considered is repeat the last frame to fill the frame list, it cost no time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just return the slice of valid frames:
valid_num_frames = num_frames
for idx in range(total_frames_num):
ok = cap.grab()
if not ok:
valid_frames -= 1
continue
if idx in frame_idx:
ret, frame = cap.retrieve()
if ret:
frames[i] = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
i += 1
...
return frames[:valid_num_frames]
Purpose
Fix #20313
Fix the internal server error when the video input has broken frames.
The engine get frames num via
total_frames_num = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
, however some of the frames are broken and cannot be read viacap.grab()
andcap.retrieve()
. Thereforei
andnum_frames
are not equal, causing internal server error.Test Plan
Test Result
Before fix:
After fix:
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.