Added support for Multimodal eval #1499

anirudhs001 · 2025-02-23T04:06:35Z

Used VLMEvalWrapper and Llama3VisionTransform from torchtune to support evaluation for multimodal models (llama3.2 11b only for now).

Bumped up lm_eval to lm_eval==0.4.7 to use HFMultimodalLM, the class that VLMEvalWrapper inherits from.

A sample run for mmmu_val_art:

(venv) anirudhsingh@Anirudhs-MacBook-Pro-4 torchchat % python torchchat.py eval Llama-3.2-mm --device cpu --dtype bf16 --task mmmu_val_art --modality text-image --max-seq-length 2048 
NumExpr defaulting to 12 threads.
PyTorch version 2.7.0.dev20250124 available.
Looking for libcustom_ops_aot_lib.so in /Users/anirudhsingh/MISC/playground/torchchat/venv/lib/python3.10/site-packages/executorch
Loading custom ops library: /Users/anirudhsingh/MISC/playground/torchchat/venv/lib/python3.10/site-packages/executorch/extension/llm/custom_ops/libcustom_ops_aot_lib.dylib
Unable to import torchao experimental quant_api with error:  [Errno 2] No such file or directory: '/Users/anirudhsingh/MISC/playground/torchchat/torchao-build/src/ao/torchao/experimental/quant_api.py'
Modality of model=text-image
Using device=cpu
Loading model...
Time to load model: 0.25 seconds
-----------------------------------------------------------
Building contexts for mmmu_val_art on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:00<00:00, 20148.78it/s]
Running generate_until requests
Running generate_until requests with text+image input: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [7:49:19<00:00, 938.65s/it]
Time to run eval: 28171.31s.
Time in model.forward: 28154.47s, over 30 model evaluations
forward run time stats - Median: 360.38s Min: 355.40s Max: 8932.57s
For model /Users/anirudhsingh/.torchchat/model-cache/meta-llama/Llama-3.2-11B-Vision-Instruct/model.pth
mmmu_val_art:
 alias: Art
 acc,none: 0.2333
 acc_stderr,none: 0.0785

And with a limit of 1 sample:

(venv) anirudhsingh@Anirudhs-MacBook-Pro-4 torchchat % python torchchat.py eval Llama-3.2-mm --device cpu --dtype bf16 --task mmmu_val_art --limit 1 --modality text-image --max-seq-length 720
NumExpr defaulting to 12 threads.
PyTorch version 2.7.0.dev20250124 available.
Looking for libcustom_ops_aot_lib.so in /Users/anirudhsingh/MISC/playground/torchchat/venv/lib/python3.10/site-packages/executorch
Loading custom ops library: /Users/anirudhsingh/MISC/playground/torchchat/venv/lib/python3.10/site-packages/executorch/extension/llm/custom_ops/libcustom_ops_aot_lib.dylib
Unable to import torchao experimental quant_api with error:  [Errno 2] No such file or directory: '/Users/anirudhsingh/MISC/playground/torchchat/torchao-build/src/ao/torchao/experimental/quant_api.py'
Modality of model=text-image
Using device=cpu
Loading model...
Time to load model: 0.25 seconds
-----------------------------------------------------------
Building contexts for mmmu_val_art on rank 0...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 5159.05it/s]
Running generate_until requests
Running generate_until requests with text+image input: 100%|██████████████████████████████████████████████████████████████| 1/1 [08:38<00:00, 518.97s/it]
Time to run eval: 531.16s.
Time in model.forward: 518.80s, over 1 model evaluations
forward run time stats - Median: 518.80s Min: 518.80s Max: 518.80s
For model /Users/anirudhsingh/.torchchat/model-cache/meta-llama/Llama-3.2-11B-Vision-Instruct/model.pth
mmmu_val_art:
 alias: Art
 acc,none: 0.0000

pytorch-bot · 2025-02-23T04:06:39Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1499

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 815966c with merge base 4d8bab5 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Jack-Khuu

Nice work!
Haven't sat down and give it a full test run, but left some initial thoughts

Jack-Khuu · 2025-02-25T08:20:13Z

install/install_requirements.sh

@@ -130,5 +130,5 @@ if [[ -x "$(command -v nvidia-smi)" ]]; then
 fi
 (
  set -x
-  $PIP_EXECUTABLE install evaluate=="0.4.3" lm-eval=="0.4.2" psutil=="6.0.0"
+  $PIP_EXECUTABLE install evaluate=="0.4.3" lm-eval=="0.4.7" psutil=="6.0.0"


Beyond the scope of this PR, but the duplicated requirements in here vs requirements.txt will be collapsed when we introduce packaging

Jack-Khuu · 2025-02-25T08:23:12Z

torchchat/cli/cli.py

+        type=str,
+        default="text",
+        choices=["text", "text-image"],
+        # help=argparse.SUPPRESS,


Suggested change

# help=argparse.SUPPRESS,

Since this arg is only used for evaluation, let's bump it into _add_evaluation_args() below

torchchat/usages/eval.py

Jack-Khuu · 2025-02-25T08:29:50Z

torchchat/cli/builder.py

@@ -71,6 +71,7 @@ class BuilderArgs:
    dynamic_shapes: bool = False
    max_seq_length: Optional[int] = None
    attention_backend: str = "math"
+    modality: Optional[str] = "text"


modality isn't super related to the builderargs, so let's leave it out. I commented in the Argparser with details

Jack-Khuu · 2025-02-25T08:34:34Z

torchchat/usages/eval.py

@@ -223,6 +482,57 @@ def eval(
    return eval_results


+def multi_model_eval(


Looks like this and eval() are fairly similar. Mind combining them?

anirudhs001 · 2025-03-09T09:42:55Z

@Jack-Khuu sorry for the delay
I've made changes from your comments. Forgot to remove the modality arg from BuilderArgs, did that in the last commit. This should fix the AttributeError: 'Namespace' object has no attribute 'modality' failures [1,2,3...]

Not sure what's wrong in test-gpu-eval-sanity-check (cuda, stories15M) / linux-job though. tried running the test in main on cpu but that fails too for me

Jack-Khuu · 2025-03-09T23:11:13Z

Sorry on the delay on my side as well. I just kicked off the jobs again. Let's see what's going on

anirudhs001 · 2025-03-10T13:30:47Z

@Jack-Khuu lm_eval is causing the error. The tests pass in main with lm_eval==0.4.2, but fail with 0.4.5+

anirudhs001 · 2025-03-11T06:24:11Z

Found the problem.
While calculating the loglikelihoods, lm_eval prepends a different prefix_token now:
It has changed from (link)

rolling_token_windows = list(
    map(
        utils.make_disjoint_window,
        utils.get_rolling_token_windows(
            token_list=self.tok_encode(string),
            prefix_token=self.eot_token_id,
            max_seq_len=self.max_length,
            context_len=1,
        ),
    )
)

to (link)

rolling_token_windows: List[Tuple[List[int], List[int]]] = list(
    map(
        utils.make_disjoint_window,
        utils.get_rolling_token_windows(
            token_list=self.tok_encode(string),
            prefix_token=self.prefix_token_id,
            max_seq_len=self.max_length,
            context_len=1,
        ),
    )
)

HFLM's eot_token_id is 2, whereas prefix_token_id defaults to tokenizer.bos_token_id which is 50256
There's some module in our test model that limits tokens to be in [0,32000] and throws the index out of bounds exception.

anirudhs001 · 2025-03-18T06:46:28Z

@Jack-Khuu can you re-run the tests please?

Jack-Khuu

Changes are looking great, we should be good to land soon

Were you able to validate whether normal `pytorch generate works with the tokenizer changes btw?

https://github.com/pytorch/torchchat/blob/main/docs/multimodal.md

Jack-Khuu · 2025-03-18T17:33:55Z

torchchat/usages/eval.py

+from torchtune import utils
+from torchtune.data import (
+    format_content_with_images,
+    left_pad_sequence,
+    Message,
+    padded_collate_tiled_images_and_mask,
+)
+from torchtune.generation import generate, sample
+
+from torchtune.modules.common_utils import local_kv_cache
+from torchtune.modules.model_fusion import DeepFusionModel
+from torchtune.modules.transforms import Transform


Let's move the imports into VLMEvalWrapper.init()

We're consciously recognizing that doing so is considered bad style, but this reduces the import overhead/requirements for users not using torchtune

(You'll need to update the typehint in the model definition with strings since they don't get defined until init is called)

Were you able to validate whether normal `pytorch generate works with the tokenizer changes btw?
https://github.com/pytorch/torchchat/blob/main/docs/multimodal.md

Just did, and it threw an exception.
Need to make changes for Llama3VisionTranform

To avoid scope creeping this PR, how about we undo the tokenizer changes you made builder.py and push the tokenizer resolution just within eval.py

For example within eval we can try:

elif modality == "text-image": ... llama3_2_vision_transform(path=str(self.tokenizer_path))

Jack-Khuu · 2025-03-20T22:46:05Z

torchchat/usages/eval.py

+    # Having the imports here allow running other evals without installing torchtune
+    from torchtune import utils
+    from torchtune.data import (
+        format_content_with_images,
+        left_pad_sequence,
+        Message,
+        padded_collate_tiled_images_and_mask,
+    )
+    from torchtune.generation import generate, sample
+
+    from torchtune.modules.common_utils import local_kv_cache
+    from torchtune.modules.model_fusion import DeepFusionModel
+    from torchtune.modules.transforms import Transform


I think we need this one layer deeper inside of init

Class definition is always executed even if an class instance isn't made

I think we need this one layer deeper inside of init

Class definition is always executed even if an class instance isn't made

oh, I did not know this. One learns something every day :)

Let's move the imports into VLMEvalWrapper.init()

We're consciously recognizing that doing so is considered bad style, but this reduces the import overhead/requirements for users not using torchtune

(You'll need to update the typehint in the model definition with strings since they don't get defined until init is called)

I did, and then ran a text only eval: python torchchat.py eval stories15M --dtype fp32 --limit 5 after uninstalling torchtune.
It failed because we have torchtune imports in model.py too:

torchchat/torchchat/model.py

Lines 40 to 47 in 4d8bab5

from torchtune.models.clip import clip_vision_encoder

from torchtune.models.llama3_1._component_builders import llama3_1 as llama3_1_builder

from torchtune.models.llama3_2_vision._component_builders import (

llama3_2_vision_decoder,

llama3_2_vision_encoder,

)

from torchtune.modules.model_fusion import DeepFusionModel

I suppose model.py would be used by most/all flows in torchchat. If this is the case, do we still need to move torchtune imports inside VLMEvalWrapper?

It failed because we have torchtune imports in model.py too:

This is expected atm, we're still in the process of teasing out the torchtune dependencies, but what you have here will help that effort.

Jack-Khuu · 2025-03-24T06:58:47Z

Thanks for updating the changes!!! I'll give this another test tomorrow and we should be good to merge in

Jack-Khuu

Verified on A100

Great work @anirudhs001 merging in

anirudh added 6 commits February 23, 2025 09:32

[wip] Added cli args and other changes to eval multi-modal models

2aa67b4

remove redundant comment

78bdacf

Added Llama3VisionTransform in TokenizerArgs and other changes

bfc62dc

use kv caching and other minor fixes

8900f8a

default batch size 1

59ce657

lint eval.py and builder.py

afdb3ce

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 23, 2025

This was referenced Feb 23, 2025

Added support for Multimodal eval #1498

Closed

Multimodal Eval Enablement (Looking for Developer to Implement Design) #1334

Closed

Jack-Khuu requested review from Gasoonjia and Jack-Khuu February 23, 2025 21:21

Jack-Khuu added enhancement New feature or request Evaluation/Benchmarking Issues Related to Evaluation and Benchmarking labels Feb 23, 2025

lm-eval 0.4.2->0.4.7 in install_requirements.sh

ae66baf

Jack-Khuu reviewed Feb 25, 2025

View reviewed changes

anirudh and others added 3 commits March 3, 2025 22:19

fixes from code review

7721be9

Merge branch 'main' into multimodal-eval-2

e9c0d34

remove modality from builder args

96ab799

Jack-Khuu and others added 3 commits March 12, 2025 10:51

Merge branch 'main' into multimodal-eval-2

1e609d8

use custom prefix token

51b0e83

Merge branch 'main' into multimodal-eval-2

842be23

Jack-Khuu reviewed Mar 18, 2025

View reviewed changes

move torchtune imports inside VLMEvalWrapper

51135fd

Jack-Khuu reviewed Mar 20, 2025

View reviewed changes

anirudh added 2 commits March 23, 2025 20:00

revert changes from builder.py

14502bf

instantiate transform in eval()

815966c

Jack-Khuu approved these changes Mar 24, 2025

View reviewed changes

Jack-Khuu merged commit dcc82b9 into pytorch:main Mar 24, 2025
72 checks passed

		@@ -223,6 +482,57 @@ def eval(
		return eval_results


		def multi_model_eval(

	from torchtune.models.clip import clip_vision_encoder
	from torchtune.models.llama3_1._component_builders import llama3_1 as llama3_1_builder
	from torchtune.models.llama3_2_vision._component_builders import (
	llama3_2_vision_decoder,
	llama3_2_vision_encoder,
	)
	from torchtune.modules.model_fusion import DeepFusionModel

Added support for Multimodal eval #1499

Added support for Multimodal eval #1499

Uh oh!

Conversation

anirudhs001 commented Feb 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Feb 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1499

✅ No Failures

Uh oh!

Jack-Khuu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anirudhs001 commented Mar 9, 2025

Uh oh!

Jack-Khuu commented Mar 9, 2025

Uh oh!

anirudhs001 commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anirudhs001 commented Mar 11, 2025

Uh oh!

anirudhs001 commented Mar 18, 2025

Uh oh!

Jack-Khuu left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anirudhs001 Mar 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jack-Khuu commented Mar 24, 2025

Uh oh!

Jack-Khuu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

anirudhs001 commented Feb 23, 2025 •

edited

Loading

pytorch-bot bot commented Feb 23, 2025 •

edited

Loading

anirudhs001 commented Mar 10, 2025 •

edited

Loading

Jack-Khuu left a comment •

edited

Loading

anirudhs001 Mar 23, 2025 •

edited

Loading