Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 25 additions & 11 deletions vllm/multimodal/video.py
Original file line number Diff line number Diff line change
Expand Up @@ -121,14 +121,22 @@ def load_bytes(
original_fps = cap.get(cv2.CAP_PROP_FPS)
duration = total_frames_num / original_fps if original_fps > 0 else 0

validate_frames_list = []
for idx in range(total_frames_num):
ok = cap.grab()
if ok:
validate_frames_list.append(idx)
Comment on lines +124 to +128
Copy link
Member

@Isotr0py Isotr0py Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, iterate over the whole video can be quite expensive for long video, even if we don't load any frame to memory. I think we should avoid iterating over the whole video twice as much as possible.

Copy link
Author

@Jixin10 Jixin10 Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, iterate over the whole video can be quite expensive for long video, even if we don't load any frame to memory. I think we should avoid iterating over the whole video twice as much as possible.

Yes, and I test the time cost in my development machine, cap.grab() operation cost 5e-4s , cap.retrieve() cost 1e-3s. I test a video with 543 frames and the result is it cost 0.3s before fix and 0.6s after fix. I tried cap.set(cv2.CAP_PROP_POS_FRAMES, frame_number) but it cost even more time...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, iterate over the whole video can be quite expensive for long video, even if we don't load any frame to memory. I think we should avoid iterating over the whole video twice as much as possible.

Another method I considered is repeat the last frame to fill the frame list, it cost no time.

Copy link
Member

@Isotr0py Isotr0py Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just return the slice of valid frames:

        valid_num_frames = num_frames
        for idx in range(total_frames_num):
            ok = cap.grab()
            if not ok:
                valid_frames -= 1
                continue
            if idx in frame_idx:
                ret, frame = cap.retrieve()
                if ret:
                    frames[i] = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
                    i += 1
            ...
            return frames[:valid_num_frames]


Comment on lines +124 to +129
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
validate_frames_list = []
for idx in range(total_frames_num):
ok = cap.grab()
if ok:
validate_frames_list.append(idx)
validate_frames_list = [
idx for idx in range(total_frames_num)
if cap.grab()
]

validate_total_frames_num = len(validate_frames_list)

# resample video to target num_frames
full_read = num_frames == -1 or total_frames_num < num_frames
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The condition for full_read uses total_frames_num, which is the frame count from the video metadata and can be incorrect. This can lead to incorrect sampling behavior when the number of valid frames is less than the requested num_frames but total_frames_num is greater. You should use validate_total_frames_num here to ensure the decision is based on the actual number of readable frames.

Suggested change
full_read = num_frames == -1 or total_frames_num < num_frames
full_read = num_frames == -1 or validate_total_frames_num < num_frames

if full_read:
num_frames = total_frames_num
num_frames = validate_total_frames_num
frame_idx = list(range(0, num_frames))
else:
uniform_sampled_frames = np.linspace(0,
total_frames_num - 1,
validate_total_frames_num - 1,
num_frames,
dtype=int)
frame_idx = uniform_sampled_frames.tolist()
Expand All @@ -138,15 +146,21 @@ def load_bytes(
frames = np.empty((len(frame_idx), height, width, 3), dtype=np.uint8)

i = 0
for idx in range(total_frames_num):
ok = cap.grab()
if not ok:
break
if idx in frame_idx:
ret, frame = cap.retrieve()
if ret:
frames[i] = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
i += 1
cap.set(cv2.CAP_PROP_POS_FRAMES, 0)
validate_list_idx = frame_idx[i]
target_frame_pos = validate_frames_list[validate_list_idx]
Comment on lines +149 to +151
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think OpenCVDynamicVideoBackend will also encounter this issue?

for pos in range(total_frames_num):
cap.grab()
if target_frame_pos != pos:
continue
ret, frame = cap.retrieve()
if ret:
frames[i] = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
i += 1
if i >= len(frame_idx):
break
validate_list_idx = frame_idx[i]
target_frame_pos = validate_frames_list[validate_list_idx]
Comment on lines 148 to +163
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The new frame reading loop has several critical issues that make it fragile:

  1. IndexError: If frame_idx is empty (e.g., a video with no valid frames), frame_idx[i] at line 150 will raise an IndexError.
  2. Unhandled grab() failure: The cap.grab() call on line 153 is not checked. If it fails, the loop continues, which can lead to incorrect behavior and an eventual assertion failure.
  3. Incorrect retrieve() failure handling: If cap.retrieve() fails (ret is false), i is not incremented and target_frame_pos is not updated. The loop continues, but pos will never match target_frame_pos again, causing the final assertion to fail.
  4. Duplicate frames bug: If frame_idx contains duplicate indices (which can happen with np.linspace when upsampling), this logic will fail to read the duplicated frames. After reading a frame at a certain position, pos increments, and the same target_frame_pos for the duplicate entry will not be matched again.

A more robust implementation is needed to handle these cases correctly. Here is a suggested replacement that addresses these issues by using a while loop and properly managing state.

Additionally, OpenCVDynamicVideoBackend appears to use a similar, older frame reading loop which is also susceptible to some of these issues and should likely be updated as well.

        i = 0
        pos = 0
        cap.set(cv2.CAP_PROP_POS_FRAMES, 0)
        while i < len(frame_idx) and pos < total_frames_num:
            # The frame position in the original video that we want to read.
            target_pos = validate_frames_list[frame_idx[i]]

            if pos < target_pos:
                # Seek forward by grabbing frames.
                if not cap.grab():
                    break  # End of stream.
                pos += 1
                continue

            # At this point, pos should be equal to target_pos.
            # Grab the frame at the current position `pos`.
            if not cap.grab():
                break  # End of stream.

            ret, frame = cap.retrieve()
            if ret:
                rgb_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
                # This frame might be requested multiple times (upsampling).
                # Fill all duplicates for this position.
                while i < len(frame_idx) and validate_frames_list[frame_idx[i]] == target_pos:
                    frames[i] = rgb_frame
                    i += 1
            else:
                # If retrieve fails, we cannot read this frame.
                # Skip all requests for this frame to avoid getting stuck.
                while i < len(frame_idx) and validate_frames_list[frame_idx[i]] == target_pos:
                    i += 1
            
            pos += 1


assert i == num_frames, (f"Expected reading {num_frames} frames, "
f"but only loaded {i} frames from video.")
Expand Down