Skip to content

Added support for Multimodal eval #1499

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Mar 24, 2025
Merged

Conversation

anirudhs001
Copy link
Contributor

@anirudhs001 anirudhs001 commented Feb 23, 2025

PR for #1334

Used VLMEvalWrapper and Llama3VisionTransform from torchtune to support evaluation for multimodal models (llama3.2 11b only for now).

Bumped up lm_eval to lm_eval==0.4.7 to use HFMultimodalLM, the class that VLMEvalWrapper inherits from.

A sample run for mmmu_val_art:

(venv) anirudhsingh@Anirudhs-MacBook-Pro-4 torchchat % python torchchat.py eval Llama-3.2-mm --device cpu --dtype bf16 --task mmmu_val_art --modality text-image --max-seq-length 2048 
NumExpr defaulting to 12 threads.
PyTorch version 2.7.0.dev20250124 available.
Looking for libcustom_ops_aot_lib.so in /Users/anirudhsingh/MISC/playground/torchchat/venv/lib/python3.10/site-packages/executorch
Loading custom ops library: /Users/anirudhsingh/MISC/playground/torchchat/venv/lib/python3.10/site-packages/executorch/extension/llm/custom_ops/libcustom_ops_aot_lib.dylib
Unable to import torchao experimental quant_api with error:  [Errno 2] No such file or directory: '/Users/anirudhsingh/MISC/playground/torchchat/torchao-build/src/ao/torchao/experimental/quant_api.py'
Modality of model=text-image
Using device=cpu
Loading model...
Time to load model: 0.25 seconds
-----------------------------------------------------------
Building contexts for mmmu_val_art on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:00<00:00, 20148.78it/s]
Running generate_until requests
Running generate_until requests with text+image input: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [7:49:19<00:00, 938.65s/it]
Time to run eval: 28171.31s.
Time in model.forward: 28154.47s, over 30 model evaluations
forward run time stats - Median: 360.38s Min: 355.40s Max: 8932.57s
For model /Users/anirudhsingh/.torchchat/model-cache/meta-llama/Llama-3.2-11B-Vision-Instruct/model.pth
mmmu_val_art:
 alias: Art
 acc,none: 0.2333
 acc_stderr,none: 0.0785

And with a limit of 1 sample:

(venv) anirudhsingh@Anirudhs-MacBook-Pro-4 torchchat % python torchchat.py eval Llama-3.2-mm --device cpu --dtype bf16 --task mmmu_val_art --limit 1 --modality text-image --max-seq-length 720
NumExpr defaulting to 12 threads.
PyTorch version 2.7.0.dev20250124 available.
Looking for libcustom_ops_aot_lib.so in /Users/anirudhsingh/MISC/playground/torchchat/venv/lib/python3.10/site-packages/executorch
Loading custom ops library: /Users/anirudhsingh/MISC/playground/torchchat/venv/lib/python3.10/site-packages/executorch/extension/llm/custom_ops/libcustom_ops_aot_lib.dylib
Unable to import torchao experimental quant_api with error:  [Errno 2] No such file or directory: '/Users/anirudhsingh/MISC/playground/torchchat/torchao-build/src/ao/torchao/experimental/quant_api.py'
Modality of model=text-image
Using device=cpu
Loading model...
Time to load model: 0.25 seconds
-----------------------------------------------------------
Building contexts for mmmu_val_art on rank 0...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 5159.05it/s]
Running generate_until requests
Running generate_until requests with text+image input: 100%|██████████████████████████████████████████████████████████████| 1/1 [08:38<00:00, 518.97s/it]
Time to run eval: 531.16s.
Time in model.forward: 518.80s, over 1 model evaluations
forward run time stats - Median: 518.80s Min: 518.80s Max: 518.80s
For model /Users/anirudhsingh/.torchchat/model-cache/meta-llama/Llama-3.2-11B-Vision-Instruct/model.pth
mmmu_val_art:
 alias: Art
 acc,none: 0.0000

Copy link

pytorch-bot bot commented Feb 23, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1499

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 815966c with merge base 4d8bab5 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 23, 2025
@Jack-Khuu Jack-Khuu added enhancement New feature or request Evaluation/Benchmarking Issues Related to Evaluation and Benchmarking labels Feb 23, 2025
Copy link
Contributor

@Jack-Khuu Jack-Khuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!
Haven't sat down and give it a full test run, but left some initial thoughts

@@ -130,5 +130,5 @@ if [[ -x "$(command -v nvidia-smi)" ]]; then
fi
(
set -x
$PIP_EXECUTABLE install evaluate=="0.4.3" lm-eval=="0.4.2" psutil=="6.0.0"
$PIP_EXECUTABLE install evaluate=="0.4.3" lm-eval=="0.4.7" psutil=="6.0.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beyond the scope of this PR, but the duplicated requirements in here vs requirements.txt will be collapsed when we introduce packaging

type=str,
default="text",
choices=["text", "text-image"],
# help=argparse.SUPPRESS,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# help=argparse.SUPPRESS,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this arg is only used for evaluation, let's bump it into _add_evaluation_args() below

@@ -71,6 +71,7 @@ class BuilderArgs:
dynamic_shapes: bool = False
max_seq_length: Optional[int] = None
attention_backend: str = "math"
modality: Optional[str] = "text"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modality isn't super related to the builderargs, so let's leave it out. I commented in the Argparser with details

@@ -223,6 +482,57 @@ def eval(
return eval_results


def multi_model_eval(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this and eval() are fairly similar. Mind combining them?

@anirudhs001
Copy link
Contributor Author

@Jack-Khuu sorry for the delay
I've made changes from your comments. Forgot to remove the modality arg from BuilderArgs, did that in the last commit. This should fix the AttributeError: 'Namespace' object has no attribute 'modality' failures [1,2,3...]

Not sure what's wrong in test-gpu-eval-sanity-check (cuda, stories15M) / linux-job though. tried running the test in main on cpu but that fails too for me

@Jack-Khuu
Copy link
Contributor

Sorry on the delay on my side as well. I just kicked off the jobs again. Let's see what's going on

@anirudhs001
Copy link
Contributor Author

anirudhs001 commented Mar 10, 2025

@Jack-Khuu lm_eval is causing the error. The tests pass in main with lm_eval==0.4.2, but fail with 0.4.5+

@anirudhs001
Copy link
Contributor Author

Found the problem.
While calculating the loglikelihoods, lm_eval prepends a different prefix_token now:
It has changed from (link)

rolling_token_windows = list(
    map(
        utils.make_disjoint_window,
        utils.get_rolling_token_windows(
            token_list=self.tok_encode(string),
            prefix_token=self.eot_token_id,
            max_seq_len=self.max_length,
            context_len=1,
        ),
    )
)

to (link)

rolling_token_windows: List[Tuple[List[int], List[int]]] = list(
    map(
        utils.make_disjoint_window,
        utils.get_rolling_token_windows(
            token_list=self.tok_encode(string),
            prefix_token=self.prefix_token_id,
            max_seq_len=self.max_length,
            context_len=1,
        ),
    )
)

HFLM's eot_token_id is 2, whereas prefix_token_id defaults to tokenizer.bos_token_id which is 50256
There's some module in our test model that limits tokens to be in [0,32000] and throws the index out of bounds exception.

@anirudhs001
Copy link
Contributor Author

@Jack-Khuu can you re-run the tests please?

Copy link
Contributor

@Jack-Khuu Jack-Khuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes are looking great, we should be good to land soon


Were you able to validate whether normal `pytorch generate works with the tokenizer changes btw?

https://github.com/pytorch/torchchat/blob/main/docs/multimodal.md

Comment on lines 39 to 50
from torchtune import utils
from torchtune.data import (
format_content_with_images,
left_pad_sequence,
Message,
padded_collate_tiled_images_and_mask,
)
from torchtune.generation import generate, sample

from torchtune.modules.common_utils import local_kv_cache
from torchtune.modules.model_fusion import DeepFusionModel
from torchtune.modules.transforms import Transform
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move the imports into VLMEvalWrapper.init()

We're consciously recognizing that doing so is considered bad style, but this reduces the import overhead/requirements for users not using torchtune

(You'll need to update the typehint in the model definition with strings since they don't get defined until init is called)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were you able to validate whether normal `pytorch generate works with the tokenizer changes btw?
https://github.com/pytorch/torchchat/blob/main/docs/multimodal.md

Just did, and it threw an exception.
Need to make changes for Llama3VisionTranform

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid scope creeping this PR, how about we undo the tokenizer changes you made builder.py and push the tokenizer resolution just within eval.py

For example within eval we can try:

    elif modality == "text-image":
      ... llama3_2_vision_transform(path=str(self.tokenizer_path))

Comment on lines 200 to 212
# Having the imports here allow running other evals without installing torchtune
from torchtune import utils
from torchtune.data import (
format_content_with_images,
left_pad_sequence,
Message,
padded_collate_tiled_images_and_mask,
)
from torchtune.generation import generate, sample

from torchtune.modules.common_utils import local_kv_cache
from torchtune.modules.model_fusion import DeepFusionModel
from torchtune.modules.transforms import Transform
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need this one layer deeper inside of init

Class definition is always executed even if an class instance isn't made

Copy link
Contributor Author

@anirudhs001 anirudhs001 Mar 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need this one layer deeper inside of init

Class definition is always executed even if an class instance isn't made

oh, I did not know this. One learns something every day :)

Let's move the imports into VLMEvalWrapper.init()

We're consciously recognizing that doing so is considered bad style, but this reduces the import overhead/requirements for users not using torchtune

(You'll need to update the typehint in the model definition with strings since they don't get defined until init is called)

I did, and then ran a text only eval: python torchchat.py eval stories15M --dtype fp32 --limit 5 after uninstalling torchtune.
It failed because we have torchtune imports in model.py too:

from torchtune.models.clip import clip_vision_encoder
from torchtune.models.llama3_1._component_builders import llama3_1 as llama3_1_builder
from torchtune.models.llama3_2_vision._component_builders import (
llama3_2_vision_decoder,
llama3_2_vision_encoder,
)
from torchtune.modules.model_fusion import DeepFusionModel

I suppose model.py would be used by most/all flows in torchchat. If this is the case, do we still need to move torchtune imports inside VLMEvalWrapper?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It failed because we have torchtune imports in model.py too:

This is expected atm, we're still in the process of teasing out the torchtune dependencies, but what you have here will help that effort.

@Jack-Khuu
Copy link
Contributor

Thanks for updating the changes!!! I'll give this another test tomorrow and we should be good to merge in

Copy link
Contributor

@Jack-Khuu Jack-Khuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified on A100

Great work @anirudhs001 merging in

@Jack-Khuu Jack-Khuu merged commit dcc82b9 into pytorch:main Mar 24, 2025
72 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot. enhancement New feature or request Evaluation/Benchmarking Issues Related to Evaluation and Benchmarking
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants