Resize and tile split discrepancy between HF transformers and LlamaCpp for SmolVLM2

I am debugging SmolVLM2 model and find some discrepancy with HF transformers on how the image is resized and split to tiles.

I am debugging on latest master (commit: 5d195f17bc60eacc15cfb929f9403cf29ccdf419). 

To my understanding, the resizing and tile splitting logic of IDEFICS3/SmolVLM/SmolVLM2 is two steps:
- Step 1. Keep aspect-ratio unchange, resize an image so that the longer edge equals the specified param `longest_edge` in HF transformers and `preproc_image_size` in LlamaCpp, e.g., [SmolVLM2-500M-Video-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct) uses [2048](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct/blob/main/config.json#L135). See [HF IDEFICS3 code](https://github.com/huggingface/transformers/blob/77e8b9f8dfc8e736ad2f603a5b2ae2b1076ed271/src/transformers/models/idefics3/image_processing_idefics3.py#L758), [HF SmolVLM code](https://github.com/huggingface/transformers/blob/77e8b9f8dfc8e736ad2f603a5b2ae2b1076ed271/src/transformers/models/smolvlm/image_processing_smolvlm.py#L757)
- Step 2.  Resize the image again so that both edges are whole multiples of tile size. The `tile size` is specified by the same param `image_size` in HF transformer and LlamaCpp, e.g., [SmolVLM2-500M-Video-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct) uses [512](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct/blob/main/config.json#L127). See [HF IDEFICS3 code](https://github.com/huggingface/transformers/blob/77e8b9f8dfc8e736ad2f603a5b2ae2b1076ed271/src/transformers/models/idefics3/image_processing_idefics3.py#L770), [HF SmolVLM code](https://github.com/huggingface/transformers/blob/77e8b9f8dfc8e736ad2f603a5b2ae2b1076ed271/src/transformers/models/smolvlm/image_processing_smolvlm.py#L769)

Take one example, width=1272, height=716 (my real debug case), the supposed behavior (HF):
- Step 1: resize to 2048 x 1152.8, i.e., scaled by 2048/1272 on both sides.
- Step 2: resize to 2048 x 1536, because for shorter side 1152.8, the closest (w/ round up) multiple of 512 is 1536. 
Then the image is splitted to 4x3=12 tiles.

The concerned code blocks in LlamaCpp are as below, which are different. For the example 1272x716, here `refined_size ` gives 1536x1024, and hence no. tiles is 6.

The caller code ([link](https://github.com/ggml-org/llama.cpp/blob/3cfa9c3f125763305b4226bc032f1954f08990dc/tools/mtmd/clip.cpp#L3577)):
```c++
    } else if (ctx->proj_type() == PROJECTOR_TYPE_IDEFICS3) {
        // The refined size has two steps:
        // 1. Resize w/ aspect-ratio preserving such that the longer side is
        //      the preprocessor longest size
        // 2. Resize w/out preserving aspect ratio such that both sides are
        //      multiples of image_size (always rounding up)
        //
        // CITE: https://github.com/huggingface/transformers/blob/main/src/transformers/models/idefics3/image_processing_idefics3.py#L737
    👉 const clip_image_size refined_size = image_manipulation::calc_size_preserved_ratio(
            original_size, params.image_size, params.preproc_image_size);
```
Interestingly, the inline comment of this block code aligns with my understanding above. However, looking closer into the impl of function calc_size_preserved_ratio, it is not like the comment:

The `calc_size_preserved_ratio` function impl ([link here](https://github.com/ggml-org/llama.cpp/blob/3cfa9c3f125763305b4226bc032f1954f08990dc/tools/mtmd/clip.cpp#L3208)):
```c++
    // calculate the size of the **resized** image, while preserving the aspect ratio
    // the calculated size will be aligned to the nearest multiple of align_size
    // if H or W size is larger than max_dimension, it will be resized to max_dimension
    static clip_image_size calc_size_preserved_ratio(const clip_image_size & inp_size, const int align_size, const int max_dimension) {
        if (inp_size.width <= 0 || inp_size.height <= 0 || align_size <= 0 || max_dimension <= 0) {
            return {0, 0};
        }

    👉 float scale = std::min(1.0f, std::min(static_cast<float>(max_dimension) / inp_size.width,
                                              static_cast<float>(max_dimension) / inp_size.height));

        float target_width_f  = static_cast<float>(inp_size.width)  * scale;
        float target_height_f = static_cast<float>(inp_size.height) * scale;

        int aligned_width  = CLIP_ALIGN((int)target_width_f,  align_size);
        int aligned_height = CLIP_ALIGN((int)target_height_f, align_size);

        return {aligned_width, aligned_height};
    }
```
The magic is marked with hand pointer: with my above example, 1272x716, here `scale=1`. It supposed to be `scale=2048/1272=1.61`. The problem is on the outter `std::min(1.0f, ...` which caps the scale by maximal 1. That's the reason of the diff. If removing the outter `std::min(1.0f, ...`, in all my debug cases of SmolVLM2, LlamaCpp results are exactly the same with HF transformer inference.

I am quite new to the LlamaCpp code so I am not familiar with how widely `PROJECTOR_TYPE_IDEFICS3` being used by other models, so I am curious about reason of the above piece of code logic. Is there any special reason, e.g., some models disallow scale up the image so cap the scale by 1.0 or any other reason or just a missed bug? 

Appreciated if anyone can jump in point some clues.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Resize and tile split discrepancy between HF transformers and LlamaCpp for SmolVLM2 #16776

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Resize and tile split discrepancy between HF transformers and LlamaCpp for SmolVLM2 #16776

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions