Multimodal Eval Enablement (Looking for Developer to Implement Design)

### 🚀 The feature, motivation and pitch

***Please note that since the actual implementation is going to be simple, and the design has already been reviewed, the purpose of this GitHub Issue is to look for a developer to implement this feature ASAP.***

LLM eval stands for the process of assessing the perplexity, performance and capabilities of LLMs, usually by having the model complete one or a series of tasks and assigning them scores. Torchchat is already using EleutherAI’s [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to do eval on text LLM ([code pointer](https://github.com/pytorch/torchchat/blob/11dcbebe6bd2ee933f7302b4e14baa23761abc0c/torchchat/usages/eval.py#L198)). Recently, torchtune has worked with EleutherAI to enable eval on text-image models in the harness, and has integrated this feature into torchtune ([code pointer](https://github.com/pytorch/torchtune/blob/d0c6460b51fc18245b3da0220568e10b3de06b63/recipes/eleuther_eval.py#L40)). Torchchat wants to just copy that solution from torchtune for text-image models.

Without the ability to do eval on multimodal LLMs, the enablement of multimodal LLMs on torchchat is incomplete. It’s critical to understand how well torchchat performs with image inputs. 

### Additional context

## Assumptions



* The eval for text LLMs is already enabled on torchchat. Code pointer to the [core eval function](https://github.com/pytorch/torchchat/blob/11dcbebe6bd2ee933f7302b4e14baa23761abc0c/torchchat/usages/eval.py#L172) and the [main function](https://github.com/pytorch/torchchat/blob/11dcbebe6bd2ee933f7302b4e14baa23761abc0c/torchchat/usages/eval.py#L226).
* The Llama 3.2-11b multimodal model has been onboarded to torchchat, and in the future there will be more multimodal LLMs on torchchat. 
* EleutherAI’s [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) has enabled eval on llama3.2-11b, thus we don’t need to make code changes in EleutherAI repo.


## The Main Goal
A torchchat user can run eval on the llama 3.2-11b model (which image-text-in, text-out). Note that we don’t need to worry about the internals of how the eval happens because we will only be calling the EleutherAI’s eval libraries and report the metrics it returns. 

The user interface will be a commandline `python torchchat.py eval <model-name>` with additional arguments specifying detailed requirements for the eval tasks.

The result will be printed out on the terminal which include the following metrics:
 * Tasks that have been run 
 * The score to each task 
 * The time it took to run each task


### RFC (Optional)

# Design


## Overview

In this design, the multimodal eval in torchchat will borrow from the implementation of multimodal eval in torchtune which utilizes EleutherAI’s [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness). The reason we can do this is that torchchat uses the same Llama 3.2-11b model definition as torchtune. 

## Details


### The Core Eval Implementation


#### [Preferred] Approach A: import the implementation of `HFMultimodalLM` from torchtune directly 
The easiest implementation is to import the implementation of <code>HFMultimodalLM </code>directly from torchtune, then call <code>evaluate()</code> with this wrapper class passed in. </em>

Here’s torchtune’s implementation of `HFMultimodalLM`: [code pointer](https://github.com/pytorch/torchtune/blob/ced1a840300b1ab550dac4fc2054b187f5b45c8c/recipes/eleuther_eval.py#L68).

*Pseudocode:*
```
# In eval.py
from torchtune.recipes.eleuther_eval import _VLMEvalWrapper

if model is text-based:
   do the existing text-based model eval
elif model is text-image-based:
   eval_results = evaluate(_VLMEvalWrapper(...))
```

The pros and cons of this solution is discussed in the following “Alternatives Discussion” section. This solution should be the one to start with given how quick it can enable multimodal eval on torchchat. If for some unforeseen reason that it doesn’t work, then take the following approach that requires more work.


#### Approach B: copy the implementation of `HFMultimodalLM` from torchtune



1. Creating a wrapper class that overrides class <code>[HFMultimodalLM](https://github.com/EleutherAI/lm-evaluation-harness/blob/0845b588303f1f59af98dd1c5bdbd78a9e75a1e2/lm_eval/models/hf_vlms.py#L30)</code>, which is an abstract Hugging Face model class for multimodal models. The implementation of this class can be copied from torchtune, [code pointer](https://github.com/pytorch/torchtune/blob/ced1a840300b1ab550dac4fc2054b187f5b45c8c/recipes/eleuther_eval.py#L68).
2. Then call <code>evaluate()</code> with this wrapper class passed in. 

*Pseudocode:*
```
# In eval.py
from lm_eval.models.hf_vlms import HFMultimodalLM
from lm_eval.evaluator import evaluate

class VLMEvalWrapper(HFMultimodalLM):
   ...# implementation

if model is text-based:
   do the existing text-based model eval
elif model is text-image-based:
   eval_results = evaluate(VLMEvalWrapper(...))
```

### The Commandline Arguments

User command should be `python torchchat.py eval llama3.2-11b` + some optional arguments.

In terms of implementation, reuse the same cli entry point as the text eval: [torchchat.py](https://github.com/pytorch/torchchat/blob/77774d2345dee150d7bdc2dbd22529cbde388ed7/torchchat.py#L89), [eval.py](https://github.com/pytorch/torchchat/blob/77774d2345dee150d7bdc2dbd22529cbde388ed7/torchchat/usages/eval.py#L226). Then in [def eval()](https://github.com/pytorch/torchchat/blob/77774d2345dee150d7bdc2dbd22529cbde388ed7/torchchat/usages/eval.py#L172), have an if-else to decide which eval wrapper (`GPTFastEvalWrapper` or the new `VLMEvalWrapper`) to use based on model type. 


## Alternatives Discussion

Discuss the pros and cons of importing torchtune’s implementation directly

Pro: 



1. Easy to implement because it’s just an import
2. Consistency between torchchat and torchtune
3. Easy maintenance for us
4. Torchtune has a better relationship with EleutherAI 

Cons:



1. Hard to customize the implementation for torchchat’s needs
2. For some models, we use model definitions that are different from torchtune’s
3. We rely on the compatibility on their side
4. We have more dependency on torchtune


## Testing & Tooling Plan

Run command `python torchchat.py eval llama3.2-11b` with different parameter combinations. 

The expected output is the tasks that have been run, their scores and the time it took to run each task. 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multimodal Eval Enablement (Looking for Developer to Implement Design) #1334

🚀 The feature, motivation and pitch

Additional context

Assumptions

The Main Goal

RFC (Optional)

Design

Overview

Details

The Core Eval Implementation

[Preferred] Approach A: import the implementation of `HFMultimodalLM` from torchtune directly

Approach B: copy the implementation of `HFMultimodalLM` from torchtune

The Commandline Arguments

Alternatives Discussion

Testing & Tooling Plan

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multimodal Eval Enablement (Looking for Developer to Implement Design) #1334

Description

🚀 The feature, motivation and pitch

Additional context

Assumptions

The Main Goal

RFC (Optional)

Design

Overview

Details

The Core Eval Implementation

[Preferred] Approach A: import the implementation of HFMultimodalLM from torchtune directly

Approach B: copy the implementation of HFMultimodalLM from torchtune

The Commandline Arguments

Alternatives Discussion

Testing & Tooling Plan

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Preferred] Approach A: import the implementation of `HFMultimodalLM` from torchtune directly

Approach B: copy the implementation of `HFMultimodalLM` from torchtune