-
Notifications
You must be signed in to change notification settings - Fork 13k
Open
Labels
refactoringRefactoringRefactoring
Description
Background Description
Currently, clip_image_preprocess
still looks quite messy.
From a graphic designer perspective, this function is purely just a "photoshop in cpp", its main purpose is to preprocess a given image before sending it to the transformer. The preprocess involves: crop / resize / pad the given image.
Currently, there are some strategies to preprocess an image:
- Resize to a fixed (square) size and add padding if the ratio is not square (used by llava 1.5, gemma 3, GLM)
Note: llava 1.5 use a gray-ish color for padding, while the rest use black color - Allow dynamic resolution / ratio, but limit max size (used by qwen2vl, pixtral)
Image will still need to be resized to the nearest multiply of patch size - Crop the image into slices, aka llava-uhd (used by llava 1.6, minicpm-v)
Possible Refactor Approaches
Make an enum, split into dedicated function and give them good naming.
Metadata
Metadata
Assignees
Labels
refactoringRefactoringRefactoring