Add MLlama fast image processor #41391

yonigozlan · 2025-10-06T20:51:49Z

What does this PR do?

Finishes #37539

It also adds support to group other inputs that mirrors the shape of images in group_images_by_shape, it could also be useful for some other future video processors @zucchini-nlp ;).

…rocessor

…puts in group by shape

HuggingFaceDocBuilderDev · 2025-10-06T21:01:29Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

molbap

Nice work! left a smol review for the image_transforms.py fix first! Will do the rest in a follow-up

molbap · 2025-10-07T16:57:37Z

src/transformers/image_transforms.py

+            return (
+                {(i, j): images[i][j].unsqueeze(0) for i in range(len(images)) for j in range(len(images[i]))},
+                *[
+                    {
+                        (i, j): paired_list[i][j].unsqueeze(0)
+                        for i in range(len(paired_list))
+                        for j in range(len(paired_list[i]))
+                    }
+                    for paired_list in paired_inputs
+                ],
+                {(i, j): ((i, j), 0) for i in range(len(images)) for j in range(len(images[i]))},
+            )
        else:
-            return {i: images[i].unsqueeze(0) for i in range(len(images))}, {i: (i, 0) for i in range(len(images))}
+            return (
+                {i: images[i].unsqueeze(0) for i in range(len(images))},
+                *[{i: paired_list[i].unsqueeze(0) for i in range(len(paired_list))} for paired_list in paired_inputs],
+                {i: (i, 0) for i in range(len(images))},
+            )


I managed to understand this after some time with pen&paper, it would be nice to rewrite a bit to have a clearer logic flow 😀

I would suggest writing another helper function, private, like build_ungrouped_outputs that unrolls this logic, and returns a tuple with the images dictionaries, the *unpacked paired dictionaries, and the index map.

Also for the dictionary iterators, we can iterate on keys established once (in the non-nested case) with keys = list(range(len(images))) to avoid having several range(len(...) calls

basically naming and moving these into another function, that has the same return value and return type.

Indeed this needs some cleaning up! And you're right we can optimize it a bit, and that's the whole point :)

cool, let me know when you've done another pass!

zucchini-nlp

I agree that we need to refactor a bit the grouping fn or maybe we can reorder the code so we don't need any grouping. Up to you :)

zucchini-nlp · 2025-10-08T09:11:16Z

src/transformers/models/mllama/image_processing_mllama_fast.py

+def _validate_size(size: SizeDict) -> None:
+    if not (size.height and size.width):
+        raise ValueError(f"Argument `size` must be a dictionary with keys 'height' and 'width'. Got: {size}")
+    if size.height != size.width:
+        raise ValueError(f"Argument `size` must have the same height and width, got {size}")
+
+
+def _validate_mllama_preprocess_arguments(do_resize, size, do_pad, max_image_tiles):
+    if not do_pad:
+        raise ValueError("MllamaImageProcessor doesn't support `do_pad=False` mode.")
+    if not do_resize:
+        raise ValueError("MllamaImageProcessor doesn't support `do_resize=False` mode.")
+    if max_image_tiles is None or max_image_tiles <= 0:
+        raise ValueError(f"MllamaImageProcessor `max_image_tiles` must be a positive integer, got {max_image_tiles}.")
+    _validate_size(size)


hmm, first time seeing custom validation for kwargs in image processing. I am merging #40793 today, maybe we can think of a cleaner way to validate kwargs with hub's validators later

Good thing is that hub validation runs at __setattr__ if we use dataclasses, but with typed dicts it is much more primitive currently

agreed, should not be custom here - it is not standard, and it's low-level so should be abstracted away.

Probably not needed indeed! Overall I'm not against custom validation (maybe with warning instead of errors though), as some image processors have different constraints than others, so hard to abstract away. In that case, setting resize or pad to False will still resize and pad with no warning.
Also if I'm understanding #40793 correctly, this would only validate the kwargs when loading with from_pretrained? Or will it also work when adding kwargs to the processors call?

The hub validation currently checks only for type hints from the TypedDict in every processing call. For doing the check on every from_pretrained() we have to be sure that all hub configs are saved correctly, because it will raise errors otherwise. I don't want to break configs if they are serialized bad. My next idea is to add a global validation with all values after the per-field type hints (currently validate_kwargs and valdate_processor_arguments fn)

Also, we can add per-field custom validation if we add it in the metadata like my_field: Annotated[int, custom_valdation_fn()]. Though I just found out yesterday that TypedDict does not save these metadata and I am looking for a way to recover them back

yay for Annotated! https://docs.python.org/3/library/typing.html#typing.get_type_hints with include_extras=True might be what you're looking for?

yeah, it neither worked. I found that it works when the metadata is a string, but not with callables. A workaround is to wrap the callable with hub utility, that worked in the past and was removed at some point when iterating

ah I understand, you were resolving it at load time with importlib or something like that?

nope, I am also using the get_type_hints with extras such as get_type_hints(ImagesKwargs, include_extras=True). Some quirks of Annotated ig, havent had chance to dig into the root reason yet

src/transformers/models/mllama/image_processing_mllama_fast.py

zucchini-nlp · 2025-10-08T09:24:28Z

src/transformers/models/mllama/image_processing_mllama_fast.py

+            # same aspect ratio for all images in the batch
+            num_tiles_height, num_tiles_width = grouped_aspect_ratios[shape][0]
+            stacked_images = split_to_tiles(stacked_images, num_tiles_height, num_tiles_width)
+            processed_images_grouped[shape] = stacked_images


i think we can also do the splitting before "rescale" in the prev for-loop, since rescale is simply multiplying by a value and doesn't depend on image shape. That way we don't need to group aspect ratios together with images

You're right my bad! It can be much simpler, and no need for to group aspect ratios together afterwards. However I think we could keep the option to group additional *args, as it will be useful for maskformer in this PR #41393

molbap

Left some additional comments!

molbap · 2025-10-08T08:11:54Z

src/transformers/image_transforms.py

 def group_images_by_shape(
    images: Union[list["torch.Tensor"], "torch.Tensor"],
+    *paired_inputs,
    disable_grouping: bool,


nit but disable_grouping is a tri-state so should be Optional at least

molbap · 2025-10-08T08:13:37Z

src/transformers/image_transforms.py

+            return (
+                {(i, j): images[i][j].unsqueeze(0) for i in range(len(images)) for j in range(len(images[i]))},
+                *[
+                    {
+                        (i, j): paired_list[i][j].unsqueeze(0)
+                        for i in range(len(paired_list))
+                        for j in range(len(paired_list[i]))
+                    }
+                    for paired_list in paired_inputs
+                ],
+                {(i, j): ((i, j), 0) for i in range(len(images)) for j in range(len(images[i]))},
+            )
        else:
-            return {i: images[i].unsqueeze(0) for i in range(len(images))}, {i: (i, 0) for i in range(len(images))}
+            return (
+                {i: images[i].unsqueeze(0) for i in range(len(images))},
+                *[{i: paired_list[i].unsqueeze(0) for i in range(len(paired_list))} for paired_list in paired_inputs],
+                {i: (i, 0) for i in range(len(images))},
+            )


cool, let me know when you've done another pass!

molbap · 2025-10-08T09:50:08Z

src/transformers/models/mllama/image_processing_mllama_fast.py

+def _validate_size(size: SizeDict) -> None:
+    if not (size.height and size.width):
+        raise ValueError(f"Argument `size` must be a dictionary with keys 'height' and 'width'. Got: {size}")
+    if size.height != size.width:
+        raise ValueError(f"Argument `size` must have the same height and width, got {size}")
+
+
+def _validate_mllama_preprocess_arguments(do_resize, size, do_pad, max_image_tiles):
+    if not do_pad:
+        raise ValueError("MllamaImageProcessor doesn't support `do_pad=False` mode.")
+    if not do_resize:
+        raise ValueError("MllamaImageProcessor doesn't support `do_resize=False` mode.")
+    if max_image_tiles is None or max_image_tiles <= 0:
+        raise ValueError(f"MllamaImageProcessor `max_image_tiles` must be a positive integer, got {max_image_tiles}.")
+    _validate_size(size)


agreed, should not be custom here - it is not standard, and it's low-level so should be abstracted away.

molbap · 2025-10-08T09:52:33Z

tests/models/mllama/test_image_processing_mllama.py

        num_channels=3,
        image_size=18,
-        num_images=18,
+        num_images=1,


for my information, why is this test config dropped to 1 image?

Missed this when I took over the PR, thanks for pointing it out!

molbap · 2025-10-08T09:53:10Z

src/transformers/models/mllama/image_processing_mllama_fast.py

+    def convert_to_rgb(
+        self,
+        image: ImageInput,
+    ) -> ImageInput:
+        """
+        Converts an image to RGB format. Only converts if the image is of type PIL.Image.Image, otherwise returns the image
+        as is.
+        Args:
+            image (ImageInput):
+                The image to convert.
+
+        Returns:
+            ImageInput: The converted image.
+        """
+        return convert_to_rgb(image)


would be nice to directly rely on the imported func

src/transformers/models/mllama/image_processing_mllama_fast.py

molbap · 2025-10-08T10:01:37Z

tests/models/mllama/test_image_processing_mllama.py

+        for image_processing_class in self.image_processor_list:
+            image_processing = image_processing_class(**self.image_processor_dict)
+            self.assertTrue(hasattr(image_processing, "do_convert_rgb"))
+            self.assertTrue(hasattr(image_processing, "do_resize"))
+            self.assertTrue(hasattr(image_processing, "size"))
+            self.assertTrue(hasattr(image_processing, "do_rescale"))
+            self.assertTrue(hasattr(image_processing, "rescale_factor"))
+            self.assertTrue(hasattr(image_processing, "do_normalize"))
+            self.assertTrue(hasattr(image_processing, "image_mean"))
+            self.assertTrue(hasattr(image_processing, "image_std"))
+            self.assertTrue(hasattr(image_processing, "do_pad"))
+            self.assertTrue(hasattr(image_processing, "max_image_tiles"))


thinking for a follow-up, we can automate these

molbap · 2025-10-08T10:08:32Z

src/transformers/models/mllama/image_processing_mllama_fast.py

+    aspect_ratio_mask = torch.zeros((batch_size, max_num_images, max_image_tiles), dtype=torch.long)
+
+    # Set the first tile to 1 for all aspect ratios
+    # because in original implementation aspect ratios are padded with (1, 1),
+    # but original code examples are not built to handle batches, so we might remove it later
+    aspect_ratio_mask[:, :, 0] = 1
+
+    # Set the aspect ratio mask for the rest of the tiles
+    for i, sample_aspect_ratios in enumerate(aspect_ratios):
+        for j, (num_tiles_w, num_tiles_h) in enumerate(sample_aspect_ratios):
+            aspect_ratio_mask[i, j, : num_tiles_w * num_tiles_h] = 1
+
+    return aspect_ratio_mask


There's a couple double for loops in the util functions here, it's a minor optim but we could precompute + broadcast instead, would be more efficient especially for large batches IMO

…ge-proc

yonigozlan · 2025-10-08T19:36:33Z

Thanks @molbap and @zucchini-nlp! I addressed most of your remarks, it should be ready for another review :)

molbap

Looks good, left some open questions that are not blockers

molbap · 2025-10-09T12:56:32Z

src/transformers/image_transforms.py

+def split_to_tiles(images: "torch.Tensor", num_tiles_height: int, num_tiles_width: int) -> "torch.Tensor":
+    # Split image into number of required tiles (width x height)
+    batch_size, num_channels, height, width = images.size()
+    images = images.view(
+        batch_size,
+        num_channels,
+        num_tiles_height,
+        height // num_tiles_height,
+        num_tiles_width,
+        width // num_tiles_width,
+    )
+    # Permute dimensions to reorder the axes
+    image = images.permute(0, 2, 4, 1, 3, 5).contiguous()
+    # Reshape into the desired output shape (batch_size * 4, num_channels, width/2, height/2)
+    image = image.view(
+        batch_size,
+        num_tiles_width * num_tiles_height,
+        num_channels,
+        height // num_tiles_height,
+        width // num_tiles_width,
+    )
+    return image


On this, gave it some thought. We're viewing the tensors to have the strides match, permuting, then calling contiguous and then re-viewing. It looks very similar to what an Unfold would do, getting a strided view directly. Out of scope for this PR but to keep in mind wrt optimizations.

What do you think?

Definitely something to explore! However in this case it looks like Unfold doesn't work with uint8...

Ah yes it needs the division flexibility maybe, wasn't aware

molbap · 2025-10-09T12:57:01Z

src/transformers/image_transforms.py

+def _disable_grouping_output_flat(images, *paired_inputs):
+    """Build the disable_grouping output tuple for a flat list structure."""
+    idx_range = range(len(images))
+    images_dict = {i: images[i].unsqueeze(0) for i in idx_range}
+    paired_dicts = [{i: paired_list[i].unsqueeze(0) for i in idx_range} for paired_list in paired_inputs]
+    index_map = {i: (i, 0) for i in idx_range}
+    return images_dict, *paired_dicts, index_map


molbap · 2025-10-09T13:07:36Z

src/transformers/models/mllama/image_processing_mllama_fast.py

+def _validate_size(size: SizeDict) -> None:
+    if not (size.height and size.width):
+        raise ValueError(f"Argument `size` must be a dictionary with keys 'height' and 'width'. Got: {size}")
+    if size.height != size.width:
+        raise ValueError(f"Argument `size` must have the same height and width, got {size}")
+
+
+def _validate_mllama_preprocess_arguments(do_resize, size, do_pad, max_image_tiles):
+    if not do_pad:
+        raise ValueError("MllamaImageProcessor doesn't support `do_pad=False` mode.")
+    if not do_resize:
+        raise ValueError("MllamaImageProcessor doesn't support `do_resize=False` mode.")
+    if max_image_tiles is None or max_image_tiles <= 0:
+        raise ValueError(f"MllamaImageProcessor `max_image_tiles` must be a positive integer, got {max_image_tiles}.")
+    _validate_size(size)


ah I understand, you were resolving it at load time with importlib or something like that?

molbap · 2025-10-09T13:15:01Z

src/transformers/image_transforms.py

    if disable_grouping:
        if is_nested:
-            return {(i, j): images[i][j].unsqueeze(0) for i in range(len(images)) for j in range(len(images[i]))}, {
-                (i, j): ((i, j), 0) for i in range(len(images)) for j in range(len(images[i]))
-            }
+            return _disable_grouping_output_nested(images, *paired_inputs)
        else:
-            return {i: images[i].unsqueeze(0) for i in range(len(images))}, {i: (i, 0) for i in range(len(images))}
+            return _disable_grouping_output_flat(images, *paired_inputs)


the double if sounds more like a match use-case (pun intended 🤓), but fine as it is, it's clear to read

github-actions · 2025-10-13T09:05:22Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, idefics2, llama4, mllama

* Merge conflict * add fast processor * add fast processor * make style * add new convert rgb * use nested group by shape in mllama fast, add support for multiple inputs in group by shape * refactor after review --------- Co-authored-by: Vincent <[email protected]>

rootonchair and others added 7 commits April 15, 2025 23:46

Merge conflict

f2e7da2

add fast processor

9c44566

add fast processor

d8c9133

make style

4afa022

add new convert rgb

b37409b

Merge remote-tracking branch 'upstream/main' into mllama_fast_image_p…

8652eb2

…rocessor

use nested group by shape in mllama fast, add support for multiple in…

b65c1d2

…puts in group by shape

yonigozlan requested review from molbap and zucchini-nlp October 6, 2025 20:51

yonigozlan mentioned this pull request Oct 6, 2025

Fix MaskFormer/Mask2Former fast image processors #41393

Merged

molbap reviewed Oct 7, 2025

View reviewed changes

zucchini-nlp reviewed Oct 8, 2025

View reviewed changes

molbap reviewed Oct 8, 2025

View reviewed changes

yonigozlan added 2 commits October 8, 2025 13:54

Merge remote-tracking branch 'upstream/main' into add-mllama-fast-ima…

5cfa430

…ge-proc

refactor after review

f4843e3

molbap self-requested a review October 9, 2025 08:34

molbap approved these changes Oct 9, 2025

View reviewed changes

Merge branch 'main' into add-mllama-fast-image-proc

51e813c

yonigozlan enabled auto-merge (squash) October 9, 2025 17:03

Merge branch 'main' into add-mllama-fast-image-proc

47934b9

yonigozlan merged commit eb28242 into huggingface:main Oct 13, 2025
25 checks passed

Add MLlama fast image processor #41391

Add MLlama fast image processor #41391

Uh oh!

Conversation

yonigozlan commented Oct 6, 2025

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Oct 6, 2025

Uh oh!

molbap left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

molbap left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yonigozlan commented Oct 8, 2025

Uh oh!

molbap left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment