-
Notifications
You must be signed in to change notification settings - Fork 13.6k
Description
I am debugging SmolVLM2 model and find some discrepancy with HF transformers on how the image is resized and split to tiles.
I am debugging on latest master (commit: 5d195f1).
To my understanding, the resizing and tile splitting logic of IDEFICS3/SmolVLM/SmolVLM2 is two steps:
- Step 1. Keep aspect-ratio unchange, resize an image so that the longer edge equals the specified param
longest_edgein HF transformers andpreproc_image_sizein LlamaCpp, e.g., SmolVLM2-500M-Video-Instruct uses 2048. See HF IDEFICS3 code, HF SmolVLM code - Step 2. Resize the image again so that both edges are whole multiples of tile size. The
tile sizeis specified by the same paramimage_sizein HF transformer and LlamaCpp, e.g., SmolVLM2-500M-Video-Instruct uses 512. See HF IDEFICS3 code, HF SmolVLM code
Take one example, width=1272, height=716 (my real debug case), the supposed behavior (HF):
- Step 1: resize to 2048 x 1152.8, i.e., scaled by 2048/1272 on both sides.
- Step 2: resize to 2048 x 1536, because for shorter side 1152.8, the closest (w/ round up) multiple of 512 is 1536.
Then the image is splitted to 4x3=12 tiles.
The concerned code blocks in LlamaCpp are as below, which are different. For the example 1272x716, here refined_size gives 1536x1024, and hence no. tiles is 6.
The caller code (link):
} else if (ctx->proj_type() == PROJECTOR_TYPE_IDEFICS3) {
// The refined size has two steps:
// 1. Resize w/ aspect-ratio preserving such that the longer side is
// the preprocessor longest size
// 2. Resize w/out preserving aspect ratio such that both sides are
// multiples of image_size (always rounding up)
//
// CITE: https://github.com/huggingface/transformers/blob/main/src/transformers/models/idefics3/image_processing_idefics3.py#L737
👉 const clip_image_size refined_size = image_manipulation::calc_size_preserved_ratio(
original_size, params.image_size, params.preproc_image_size);Interestingly, the inline comment of this block code aligns with my understanding above. However, looking closer into the impl of function calc_size_preserved_ratio, it is not like the comment:
The calc_size_preserved_ratio function impl (link here):
// calculate the size of the **resized** image, while preserving the aspect ratio
// the calculated size will be aligned to the nearest multiple of align_size
// if H or W size is larger than max_dimension, it will be resized to max_dimension
static clip_image_size calc_size_preserved_ratio(const clip_image_size & inp_size, const int align_size, const int max_dimension) {
if (inp_size.width <= 0 || inp_size.height <= 0 || align_size <= 0 || max_dimension <= 0) {
return {0, 0};
}
👉 float scale = std::min(1.0f, std::min(static_cast<float>(max_dimension) / inp_size.width,
static_cast<float>(max_dimension) / inp_size.height));
float target_width_f = static_cast<float>(inp_size.width) * scale;
float target_height_f = static_cast<float>(inp_size.height) * scale;
int aligned_width = CLIP_ALIGN((int)target_width_f, align_size);
int aligned_height = CLIP_ALIGN((int)target_height_f, align_size);
return {aligned_width, aligned_height};
}The magic is marked with hand pointer: with my above example, 1272x716, here scale=1. It supposed to be scale=2048/1272=1.61. The problem is on the outter std::min(1.0f, ... which caps the scale by maximal 1. That's the reason of the diff. If removing the outter std::min(1.0f, ..., in all my debug cases of SmolVLM2, LlamaCpp results are exactly the same with HF transformer inference.
I am quite new to the LlamaCpp code so I am not familiar with how widely PROJECTOR_TYPE_IDEFICS3 being used by other models, so I am curious about reason of the above piece of code logic. Is there any special reason, e.g., some models disallow scale up the image so cap the scale by 1.0 or any other reason or just a missed bug?
Appreciated if anyone can jump in point some clues.