Skip to content

Refactor: (clip.cpp) identify and regroup pre-processing strategies #13077

@ngxson

Description

@ngxson

Background Description

Currently, clip_image_preprocess still looks quite messy.

From a graphic designer perspective, this function is purely just a "photoshop in cpp", its main purpose is to preprocess a given image before sending it to the transformer. The preprocess involves: crop / resize / pad the given image.

Currently, there are some strategies to preprocess an image:

  • Resize to a fixed (square) size and add padding if the ratio is not square (used by llava 1.5, gemma 3, GLM)
    Note: llava 1.5 use a gray-ish color for padding, while the rest use black color
  • Allow dynamic resolution / ratio, but limit max size (used by qwen2vl, pixtral)
    Image will still need to be resized to the nearest multiply of patch size
  • Crop the image into slices, aka llava-uhd (used by llava 1.6, minicpm-v)

Possible Refactor Approaches

Make an enum, split into dedicated function and give them good naming.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions