Skip to content

Commit 00657e6

Browse files
kcz358cocosheBo Ligagan3012CaraJ7
authored
Add wild vision from public (#131)
* fix doc * [WIP] adding mmbench dev evaluation (#75) * WIP * Update GPT evaluation model name and sys prompt * 🛠️ Scale accuracy to percentage The accuracy value is now multiplied by 100 in the aggregation function to represent it as a percentage. Regarding the evaluation process, `math` module importation and refactoring reduce progress log verbosity by logging every 100 evaluations instead of 10. It prevents potential logging overflow. Handling of NaN values is added to ensure 'default_value' is set in case of missing data, avoiding errors in split, category, and l2-category assignments. Finally, reporting of categorical and l2-categorical accuracies is streamlined through a new `calculate_hit_rates` function, improving code readability and maintenance. Issue refs: #1427, #1533 * Update GPT evaluation model name and API configuration * Refactor MMBench_Evaluator class to handle missing columns * Add print statements for detailed results in MMBench-CN(CC), MMBench-CN(Dev), and MMBench-EN(Dev) evaluations * Refactor MMBench-CN and MMBench-EN evaluation functions * 🔄 Refactor result processing and logging logic - Simplified the result processing functions across different utility modules (`cc_utils.py`, `cn_utils.py`, `en_utils.py`) to unify the handling of multiple-choice options. Now, all options ("A" to "E") are dynamically added to the result data, and default to "nan" if not provided in the document. - Removed redundant keys directly from the process results dict creation to avoid clutter and align with the new dynamic addition of options. - In `mmbench_evals.py`, removed the unnecessary check for all splits being 'dev' and streamlined the evaluation loop by eliminating the progress bar (tqdm) for a cleaner log output. - Commented-out code and verbose logging during evaluation, which may have interfered with performance, has been removed for a more efficient and less intrusive logging experience. This cleanup reduces redundancy in the codebase and improves evaluation performance. Refs #2045 --------- Co-authored-by: Bo Li <[email protected]> (cherry picked from commit a19278c) * Create README.md * Add files via upload * Add MathVerse * Fix typo in qwen_vl that was causing "reference before assignment" * convert contexts to list if necessary and remove unnecessary construction of `questions` * refactor query construction for clarity * Create ScreenSpot on clean branch * Update README to reflect new tasks * Add README file specific to ScreenSpot * slight update * Init webSRC * Draft README for WebSRC * Update main README with new task names * Draft and validate websrc eval on dev split * Add code to enable compilation of submission for WebSRC test split * Bugfix: WebSRC should be token-level F1 NOT character-level * Add qwen vl api * Fix llava conv template for llama3 * Fix llava_hf generation for 1.6 * Parse result for llava_hf 1.6 * Add model_name parameter to Llava constructor * Fix endless warning for llava_hf generation * Fix llava_hf image tokens number issue * Create LICENSE * Update LICENSE * Update LICENSE * Better task list_with_num * Fix idefics2 llava in the wild bugs * Remove redundant code in fuyu * Fix instructblip qformer size mismatch and multi-images problem * Comment out parse result in xcomposer * Comment out Spice in caption task so that don't need to download stanford nlp model * Update gitignore * Add separated pope tasks by category * Fix pope random name in pope full * Set printing info for llava_hf to debug level * Adding Phi3v model. * Adding prompt arguments for Phi3v on MathVista-TestMini * Adding documentation of Phi3v class. * [Fix] import issues of multilingual llava and olympiadbench * fix compatibility issue of older version llava * add upd * add upd * add upd * add upd * add upd * add upd * Group MMMU images into one image (#83) * update * update font * Add matplotlib.font_manager import in utils.py * Refactor font handling in add_order_label function in utils.py * group mmmu --------- Co-authored-by: Li Bo <[email protected]> * merge model_specific_prompt_kwargs and dataset_name into each task yaml * Add MathVerse in README.md * slightly change query_prompt for the reproduction * update utils.py for leaderboard submission * add conbench * update README * Update README.md * init include vcr * modify the form of VCR * switch logic * add crossed_text to vcr_wiki output * include the try-except logic for spacy * update vcr_wiki tasks * update vcr_wiki tasks in README.md * include std and confidence interval * update gpt-3.5-turbo version * update gpt-3.5-turbo version * chore: Remove unnecessary files and code related to live_bench and sft_eval tasks * Bump version to 0.2.0.dev0 * chore: Update lmms-eval to support video evaluations for LLaVA models * Update llava conv_template in lmms_eval/models/llava.py * Update image alignment in README.md * chore: Update lmms-eval to support video evaluations for LLaVA models * chore: Update lmms-eval to support video evaluations for LLaVA models * Update README.md * Update README.md * update aggregation function for vcr_wiki * update README.md * Update README.md * update version * add II-Bench * fix dataset_path * Add qbench, qbench2, abench; fix phi3v as its current implementation does not support multi-image * add tinyllava * LongVideoBench support: image LMMs (idefics2, phi3) and video LMMs (LLaVA-Next-Video-34B) * fix #117, allow auto download with tar format videos * fix #117, allow auto download with tar format videos * fix typo * feat: Add support for auto downloading tar format videos * Release llava-wilder * chore: Update dependencies to fix potential risks and improve compatibility * tutorial * docs * update preparation * small fix * small fix * lint * to sh script * update readme * Remove handling non-visual loop in llava * Add llava_hf back to registry * Update README.md * Update README.md * update ablation for videomme datasets * chore: Handle ImportError when importing models Handle the ImportError exception when importing models in the lmms_eval package. This change adds a try-except block to catch the ImportError and print an error message indicating the failed import. This will help with troubleshooting and identifying any issues with the model imports. * chore: Remove unused models from lmms_eval package * feat: Allow loading model configurations from other packages * feat: Allow including external tasks from plugins * chore: Add loguru for logging in lmms_eval package * Remove unnecessary lines since use batched visuals now in llava * Add longva * Revise model registry for llava_hf and longva * Delete unnecessary lines * Remove unnecessary lines for video llava * Update pyproject.toml * Update activitynetqa_generation.yaml * Fix vid mme post prompt issue * Add wild vision 0617 * Hardcode to keep image for wild vision * Fixing scoring logic * Fixing dataset name * Fixing handling None filtered score --------- Co-authored-by: cocoshe <[email protected]> Co-authored-by: Bo Li <[email protected]> Co-authored-by: Gagan Bhatia <[email protected]> Co-authored-by: CaraJ7 <[email protected]> Co-authored-by: Li Bo <[email protected]> Co-authored-by: Andrea Tupini <[email protected]> Co-authored-by: Hunter Heidenreich <[email protected]> Co-authored-by: Victor Fragoso <[email protected]> Co-authored-by: AtsuMiyai <[email protected]> Co-authored-by: Pu Fanyi <[email protected]> Co-authored-by: Yuan Zhang <[email protected]> Co-authored-by: Yuan Zhang <[email protected]> Co-authored-by: tianyu-z <[email protected]> Co-authored-by: Suyuchen <[email protected]> Co-authored-by: XinrunDu <[email protected]> Co-authored-by: teowu <[email protected]> Co-authored-by: Jingyang <[email protected]> Co-authored-by: Teo (Timothy) Wu Haoning <[email protected]> Co-authored-by: choiszt <[email protected]> Co-authored-by: Lorenzo Mammana <[email protected]>
1 parent dfaa4c2 commit 00657e6

File tree

10 files changed

+266
-3
lines changed

10 files changed

+266
-3
lines changed

lmms_eval/__main__.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
import importlib
12
import os
23
import yaml
34
import sys
@@ -236,6 +237,12 @@ def cli_evaluate_single(args: Union[argparse.Namespace, None] = None) -> None:
236237
eval_logger.info(f"Including path: {args.include_path}")
237238
include_path(args.include_path)
238239

240+
if os.environ.get("LMMS_EVAL_PLUGINS", None):
241+
for plugin in os.environ["LMMS_EVAL_PLUGINS"].split(","):
242+
package_tasks_location = importlib.util.find_spec(f"{plugin}.tasks").submodule_search_locations[0]
243+
eval_logger.info(f"Including path: {args.include_path}")
244+
include_path(package_tasks_location)
245+
239246
if args.tasks is None:
240247
task_names = ALL_TASKS
241248
elif args.tasks == "list":

lmms_eval/evaluator.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -325,7 +325,12 @@ def evaluate(
325325
# hack: remove image columns to speed avoid loading images and speed up postprocessing
326326
# reason: doc_iterator will actually load image if it's in the doc.
327327
docs = task.test_docs() if task.has_test_docs() else task.validation_docs()
328-
if "d170" not in task_name and "dc100" not in task_name and "dc200" not in task_name and "llava_wilder" not in task_name and "live_bench" not in task_name:
328+
if "d170" not in task_name \
329+
and "dc100" not in task_name \
330+
and "dc200" not in task_name \
331+
and "llava_wilder" not in task_name \
332+
and "livebench" not in task_name \
333+
and "wildvision" not in task_name:
329334
remove_cols = []
330335
features = docs.features
331336
# If it is an Image instance or a Sequence of Image instance. Remove it

lmms_eval/models/__init__.py

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,6 @@
1+
import importlib
2+
import os
3+
import hf_transfer
14
from loguru import logger
25
import sys
36

@@ -28,6 +31,8 @@
2831
"mplug_owl_video": "mplug_Owl",
2932
"phi3v": "Phi3v",
3033
"tinyllava": "TinyLlava",
34+
"llava_hf": "LlavaHf",
35+
"longva": "LongVA",
3136
"llava_onevision": "Llava_OneVision",
3237
"llava_hf": "LlavaHf",
3338
"longva": "LongVA",
@@ -39,3 +44,17 @@
3944
except ImportError as e:
4045
# logger.warning(f"Failed to import {model_class} from {model_name}: {e}")
4146
pass
47+
48+
if os.environ.get("LMMS_EVAL_PLUGINS", None):
49+
# Allow specifying other packages to import models from
50+
for plugin in os.environ["LMMS_EVAL_PLUGINS"].split(","):
51+
m = importlib.import_module(f"{plugin}.models")
52+
for model_name, model_class in getattr(m, "AVAILABLE_MODELS").items():
53+
try:
54+
exec(f"from {plugin}.models.{model_name} import {model_class}")
55+
except ImportError:
56+
pass
57+
58+
import hf_transfer
59+
60+
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

lmms_eval/models/longva.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -458,4 +458,4 @@ def _collate(x):
458458
res = re_ords.get_original(res)
459459

460460
pbar.close()
461-
return res
461+
return res

lmms_eval/tasks/activitynetqa/activitynetqa_generation.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
dataset_name: "Generation"
21
task: "activitynetqa"
32
test_split: test
43
output_type: generate_until

lmms_eval/tasks/videomme/videomme_w_subtitle.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,11 @@ metric_list:
2626
model_specific_prompt_kwargs:
2727
default:
2828
frame_num: 32
29+
<<<<<<< HEAD
30+
pre_prompt: ""
31+
post_prompt: "\nAnswer the question using a single word or phrase."
32+
=======
33+
>>>>>>> internal_main_dev
2934
gemini_api:
3035
gemini_api_flag: "full subtitle"
3136
# gpt4v:

lmms_eval/tasks/websrc/utils.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,15 @@ def websrc_test_aggregate_results_for_submission(results, args):
6363
for result in results:
6464
out.update(result)
6565
json.dump(out, f, indent=4)
66+
<<<<<<< HEAD
67+
<<<<<<< HEAD
6668
eval_logger.info(f"Results saved to {path}.")
69+
=======
70+
lmms_logger.info(f"Results saved to {path}.")
71+
>>>>>>> internal_main_dev
72+
=======
73+
eval_logger.info(f"Results saved to {path}.")
74+
>>>>>>> internal_main_dev
6775

6876

6977
def websrc_aggregate_results(results):
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
dataset_path: WildVision/wildvision-arena-data
2+
dataset_kwargs:
3+
token: True
4+
output_type: generate_until
5+
doc_to_visual: !function utils.wild_vision_doc_to_visual
6+
doc_to_text: !function utils.wild_vision_doc_to_text
7+
doc_to_target: !function utils.wild_vision_doc_to_target
8+
generation_kwargs:
9+
max_new_tokens: 4096
10+
temperature: 0
11+
top_p: 1.0
12+
num_beams: 1
13+
do_sample: false
14+
# The return value of process_results will be used by metrics
15+
process_results: !function utils.wild_vision_process_results
16+
# Note that the metric name can be either a registed metric function (such as the case for GQA) or a key name returned by process_results
17+
metric_list:
18+
- metric: gpt_eval_score
19+
aggregation: !function utils.wild_vision_aggregation
20+
higher_is_better: true
21+
metadata:
22+
judge_model: gpt-4o
23+
baseline_model: claude-3-sonnet-20240229
Lines changed: 188 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,188 @@
1+
import json
2+
import re
3+
import os
4+
import requests
5+
import numpy as np
6+
import time
7+
import yaml
8+
from pathlib import Path
9+
from copy import deepcopy
10+
from io import BytesIO
11+
import base64
12+
13+
from loguru import logger as eval_logger
14+
15+
NUM_SECONDS_TO_SLEEP = 5
16+
17+
18+
with open(Path(__file__).parent / "_default_template_yaml", "r") as f:
19+
raw_data = f.readlines()
20+
safe_data = []
21+
for i, line in enumerate(raw_data):
22+
# remove function definition since yaml load cannot handle it
23+
if "!function" not in line:
24+
safe_data.append(line)
25+
26+
config = yaml.safe_load("".join(safe_data))
27+
28+
GPT_EVAL_MODEL_NAME = config["metadata"]["judge_model"]
29+
BASELINE_MODEL_NAME = config["metadata"]["baseline_model"]
30+
31+
API_TYPE = os.getenv("API_TYPE", "openai")
32+
33+
if API_TYPE == "openai":
34+
API_URL = os.getenv("OPENAI_API_URL", "https://api.openai.com/v1/chat/completions")
35+
API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_API_KEY")
36+
headers = {
37+
"Authorization": f"Bearer {API_KEY}",
38+
"Content-Type": "application/json",
39+
}
40+
elif API_TYPE == "azure":
41+
API_URL = os.getenv("AZURE_ENDPOINT", "https://api.cognitive.microsoft.com/sts/v1.0/issueToken")
42+
API_KEY = os.getenv("AZURE_API_KEY", "YOUR_API_KEY")
43+
headers = {
44+
"api-key": API_KEY,
45+
"Content-Type": "application/json",
46+
}
47+
48+
system_prompt = """\
49+
Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user prompt displayed below. You will be given assistant A's answer and assistant B's answer. Your job is to evaluate which assistant's answer is better.
50+
51+
Begin your evaluation by generating your own answer to the prompt. You must provide your answers before judging any answers.
52+
53+
When evaluating the assistants' answers, compare both assistants' answers with your answer. You must identify and correct any mistakes or inaccurate information.
54+
55+
Then consider if the assistant's answers are helpful, relevant, and concise. Helpful means the answer correctly responds to the prompt or follows the instructions. Note when user prompt has any ambiguity or more than one interpretation, it is more helpful and appropriate to ask for clarifications or more information from the user than providing an answer based on assumptions. Relevant means all parts of the response closely connect or are appropriate to what is being asked. Concise means the response is clear and not verbose or excessive.
56+
57+
Then consider the creativity and novelty of the assistant's answers when needed. Finally, identify any missing important information in the assistants' answers that would be beneficial to include when responding to the user prompt.
58+
59+
After providing your explanation, you must output only one of the following choices as your final verdict with a label:
60+
61+
1. Assistant A is significantly better: [[A>>B]]
62+
2. Assistant A is slightly better: [[A>B]]
63+
3. Tie, relatively the same: [[A=B]]
64+
4. Assistant B is slightly better: [[B>A]]
65+
5. Assistant B is significantly better: [[B>>A]]
66+
67+
Example output: "My final verdict is tie: [[A=B]]".\
68+
"""
69+
70+
prompt_template = "<|User Prompt|>\n{question_1}\n\n<|The Start of Assistant A's Answer|>\n{answer_1}\n<|The End of Assistant A's Answer|>\n\n<|The Start of Assistant B's Answer|>\n{answer_2}\n<|The End of Assistant B's Answer|>"
71+
72+
def get_chat_response(base64_image, prompt, max_retries=5, wait_time=10):
73+
headers = {
74+
"Authorization": f"Bearer {API_KEY}",
75+
"Content-Type": "application/json",
76+
}
77+
78+
payload = {
79+
"model": GPT_EVAL_MODEL_NAME,
80+
"messages": [
81+
{"role": "system", "content": [{"type": "text", "text": system_prompt}]},
82+
{
83+
"role": "user",
84+
"content": [
85+
{"type": "text", "text": prompt},
86+
{"type": "image_url",
87+
"image_url" : {
88+
"url" : f"data:image/jpeg;base64, {base64_image}"
89+
}
90+
},
91+
],
92+
}
93+
],
94+
"max_tokens": 1024,
95+
"temperature": 0.0,
96+
}
97+
98+
for attempt in range(max_retries):
99+
try:
100+
response = requests.post(API_URL, headers=headers, json=payload, timeout=60)
101+
response.raise_for_status()
102+
response_data = response.json()
103+
return response_data["choices"][0]["message"]["content"], GPT_EVAL_MODEL_NAME
104+
except requests.exceptions.RequestException as e:
105+
print(f"Request failed on attempt {attempt+1}: {e}")
106+
if attempt == max_retries - 1:
107+
print(f"Failed to get response after {max_retries} attempts")
108+
return "", GPT_EVAL_MODEL_NAME
109+
except Exception as e:
110+
print(f"Error on attempt {attempt+1}: {e}")
111+
return "", GPT_EVAL_MODEL_NAME
112+
113+
114+
115+
def image_to_base64(pil_image):
116+
buffered = BytesIO()
117+
pil_image.save(buffered, format="PNG")
118+
return base64.b64encode(buffered.getvalue()).decode("utf-8")
119+
120+
def get_score(judgement, pattern, pairwise=True):
121+
matches = pattern.findall(judgement)
122+
matches = [m for m in matches if m != ""]
123+
if len(set(matches)) == 0:
124+
return None, True
125+
elif len(set(matches)) == 1:
126+
if pairwise:
127+
return matches[0].strip("\n"), False
128+
return int(matches[0])
129+
else:
130+
return None, False
131+
132+
def wild_vision_doc_to_visual(doc):
133+
return [doc["image"].convert('RGB')]
134+
135+
136+
def wild_vision_doc_to_text(doc, model_specific_prompt_kwargs=None):
137+
question = doc["instruction"].strip()
138+
if "pre_prompt" in model_specific_prompt_kwargs and model_specific_prompt_kwargs["pre_prompt"] != "":
139+
question = f"{model_specific_prompt_kwargs['pre_prompt']}{question}"
140+
if "post_prompt" in model_specific_prompt_kwargs and model_specific_prompt_kwargs["post_prompt"] != "":
141+
question = f"{question}{model_specific_prompt_kwargs['post_prompt']}"
142+
return question
143+
144+
def wild_vision_doc_to_target(doc):
145+
return doc[BASELINE_MODEL_NAME]
146+
147+
148+
def wild_vision_process_results(doc, results):
149+
pred = results[0]
150+
user_prompt = prompt_template.format(question_1=doc["instruction"], answer_1=doc[BASELINE_MODEL_NAME], answer_2=pred)
151+
base64_image = image_to_base64(doc["image"])
152+
resps, gpt_name = get_chat_response(base64_image, user_prompt)
153+
score, _ = get_score(resps, pattern=re.compile("\[\[([AB<>=]+)\]\]"))
154+
155+
if score is None:
156+
score = resps
157+
158+
if "A>B" in score:
159+
final_score = -1
160+
judgement = "Worse" #Baseline better
161+
elif "A>>B" in score:
162+
final_score = -2
163+
judgement = "Worse++"
164+
elif "A=B" in score:
165+
final_score = 0
166+
judgement = "Tie"
167+
elif "B>A" in score:
168+
final_score = 1
169+
judgement = "Better"
170+
elif "B>>A" in score:
171+
final_score = 2
172+
judgement = "Better++"
173+
else:
174+
final_score = 0
175+
judgement = "Unclear"
176+
177+
178+
return {"gpt_eval_score" : {"question" : doc["instruction"], "score" : final_score, "gpt_resps" : resps, "ans_1" : doc[BASELINE_MODEL_NAME], "ans_2" : pred, "filtered_resps" : score, "judgement" : judgement}}
179+
180+
181+
def wild_vision_aggregation(results):
182+
score = 0
183+
for res in results:
184+
score += res["score"]
185+
186+
return score / len(results)
187+
188+
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
task: wildvision_0617
2+
dataset_name: release_bench_0617_with_modelresponse
3+
test_split: test500
4+
output_type: generate_until
5+
include: _default_template_yaml
6+
model_specific_prompt_kwargs:
7+
default:
8+
pre_prompt: ""
9+
post_prompt: ""

0 commit comments

Comments
 (0)