Add wild vision from public (#131)

kcz358 · cocoshe · Bo Li · web-flow · commit 00657e625e8b · 2024-07-04T18:26:56.000+08:00
* fix doc * [WIP] adding mmbench dev evaluation (#75) * WIP * Update GPT evaluation model name and sys prompt * 🛠️ Scale accuracy to percentage The accuracy value is now multiplied by 100 in the aggregation function to represent it as a percentage. Regarding the evaluation process, `math` module importation and refactoring reduce progress log verbosity by logging every 100 evaluations instead of 10. It prevents potential logging overflow. Handling of NaN values is added to ensure 'default_value' is set in case of missing data, avoiding errors in split, category, and l2-category assignments. Finally, reporting of categorical and l2-categorical accuracies is streamlined through a new `calculate_hit_rates` function, improving code readability and maintenance. Issue refs: #1427, #1533 * Update GPT evaluation model name and API configuration * Refactor MMBench_Evaluator class to handle missing columns * Add print statements for detailed results in MMBench-CN(CC), MMBench-CN(Dev), and MMBench-EN(Dev) evaluations * Refactor MMBench-CN and MMBench-EN evaluation functions * 🔄 Refactor result processing and logging logic - Simplified the result processing functions across different utility modules (`cc_utils.py`, `cn_utils.py`, `en_utils.py`) to unify the handling of multiple-choice options. Now, all options ("A" to "E") are dynamically added to the result data, and default to "nan" if not provided in the document. - Removed redundant keys directly from the process results dict creation to avoid clutter and align with the new dynamic addition of options. - In `mmbench_evals.py`, removed the unnecessary check for all splits being 'dev' and streamlined the evaluation loop by eliminating the progress bar (tqdm) for a cleaner log output. - Commented-out code and verbose logging during evaluation, which may have interfered with performance, has been removed for a more efficient and less intrusive logging experience. This cleanup reduces redundancy in the codebase and improves evaluation performance. Refs #2045 --------- Co-authored-by: Bo Li <bo.li01@bytedance.com> (cherry picked from commit a19278c) * Create README.md * Add files via upload * Add MathVerse * Fix typo in qwen_vl that was causing "reference before assignment" * convert contexts to list if necessary and remove unnecessary construction of `questions` * refactor query construction for clarity * Create ScreenSpot on clean branch * Update README to reflect new tasks * Add README file specific to ScreenSpot * slight update * Init webSRC * Draft README for WebSRC * Update main README with new task names * Draft and validate websrc eval on dev split * Add code to enable compilation of submission for WebSRC test split * Bugfix: WebSRC should be token-level F1 NOT character-level * Add qwen vl api * Fix llava conv template for llama3 * Fix llava_hf generation for 1.6 * Parse result for llava_hf 1.6 * Add model_name parameter to Llava constructor * Fix endless warning for llava_hf generation * Fix llava_hf image tokens number issue * Create LICENSE * Update LICENSE * Update LICENSE * Better task list_with_num * Fix idefics2 llava in the wild bugs * Remove redundant code in fuyu * Fix instructblip qformer size mismatch and multi-images problem * Comment out parse result in xcomposer * Comment out Spice in caption task so that don't need to download stanford nlp model * Update gitignore * Add separated pope tasks by category * Fix pope random name in pope full * Set printing info for llava_hf to debug level * Adding Phi3v model. * Adding prompt arguments for Phi3v on MathVista-TestMini * Adding documentation of Phi3v class. * [Fix] import issues of multilingual llava and olympiadbench * fix compatibility issue of older version llava * add upd * add upd * add upd * add upd * add upd * add upd * Group MMMU images into one image (#83) * update * update font * Add matplotlib.font_manager import in utils.py * Refactor font handling in add_order_label function in utils.py * group mmmu --------- Co-authored-by: Li Bo <drluodian@gmail.com> * merge model_specific_prompt_kwargs and dataset_name into each task yaml * Add MathVerse in README.md * slightly change query_prompt for the reproduction * update utils.py for leaderboard submission * add conbench * update README * Update README.md * init include vcr * modify the form of VCR * switch logic * add crossed_text to vcr_wiki output * include the try-except logic for spacy * update vcr_wiki tasks * update vcr_wiki tasks in README.md * include std and confidence interval * update gpt-3.5-turbo version * update gpt-3.5-turbo version * chore: Remove unnecessary files and code related to live_bench and sft_eval tasks * Bump version to 0.2.0.dev0 * chore: Update lmms-eval to support video evaluations for LLaVA models * Update llava conv_template in lmms_eval/models/llava.py * Update image alignment in README.md * chore: Update lmms-eval to support video evaluations for LLaVA models * chore: Update lmms-eval to support video evaluations for LLaVA models * Update README.md * Update README.md * update aggregation function for vcr_wiki * update README.md * Update README.md * update version * add II-Bench * fix dataset_path * Add qbench, qbench2, abench; fix phi3v as its current implementation does not support multi-image * add tinyllava * LongVideoBench support: image LMMs (idefics2, phi3) and video LMMs (LLaVA-Next-Video-34B) * fix #117, allow auto download with tar format videos * fix #117, allow auto download with tar format videos * fix typo * feat: Add support for auto downloading tar format videos * Release llava-wilder * chore: Update dependencies to fix potential risks and improve compatibility * tutorial * docs * update preparation * small fix * small fix * lint * to sh script * update readme * Remove handling non-visual loop in llava * Add llava_hf back to registry * Update README.md * Update README.md * update ablation for videomme datasets * chore: Handle ImportError when importing models Handle the ImportError exception when importing models in the lmms_eval package. This change adds a try-except block to catch the ImportError and print an error message indicating the failed import. This will help with troubleshooting and identifying any issues with the model imports. * chore: Remove unused models from lmms_eval package * feat: Allow loading model configurations from other packages * feat: Allow including external tasks from plugins * chore: Add loguru for logging in lmms_eval package * Remove unnecessary lines since use batched visuals now in llava * Add longva * Revise model registry for llava_hf and longva * Delete unnecessary lines * Remove unnecessary lines for video llava * Update pyproject.toml * Update activitynetqa_generation.yaml * Fix vid mme post prompt issue * Add wild vision 0617 * Hardcode to keep image for wild vision * Fixing scoring logic * Fixing dataset name * Fixing handling None filtered score --------- Co-authored-by: cocoshe <1228759711@qq.com> Co-authored-by: Bo Li <bo.li01@bytedance.com> Co-authored-by: Gagan Bhatia <49101362+gagan3012@users.noreply.github.com> Co-authored-by: CaraJ7 <1350074492@qq.com> Co-authored-by: Li Bo <drluodian@gmail.com> Co-authored-by: Andrea Tupini <tupini07@gmail.com> Co-authored-by: Hunter Heidenreich <hunter.heidenreich@rootsautomation.com> Co-authored-by: Victor Fragoso <victor.fragoso@microsoft.com> Co-authored-by: AtsuMiyai <miyai.atsuyuki.practice@gmail.com> Co-authored-by: Pu Fanyi <FPU001@e.ntu.edu.sg> Co-authored-by: Yuan Zhang <gump_well_done@163.com> Co-authored-by: Yuan Zhang <56063339+Gumpest@users.noreply.github.com> Co-authored-by: tianyu-z <zhangtianyupro@gmail.com> Co-authored-by: Suyuchen <suyuchen.wang@umontreal.ca> Co-authored-by: XinrunDu <duxinrun2000@gmail.com> Co-authored-by: teowu <realtimothyhwu@gmail.com> Co-authored-by: Jingyang <jingyang.zhang@duke.edu> Co-authored-by: Teo (Timothy) Wu Haoning <38696372+teowu@users.noreply.github.com> Co-authored-by: choiszt <ls2001927@sohu.com> Co-authored-by: Lorenzo Mammana <mammanalorenzo@outlook.it>
diff --git a/lmms_eval/__main__.py b/lmms_eval/__main__.py
@@ -1,3 +1,4 @@
+import importlib
 import os
 import yaml
 import sys
@@ -236,6 +237,12 @@ def cli_evaluate_single(args: Union[argparse.Namespace, None] = None) -> None:
         eval_logger.info(f"Including path: {args.include_path}")
         include_path(args.include_path)
 
+    if os.environ.get("LMMS_EVAL_PLUGINS", None):
+        for plugin in os.environ["LMMS_EVAL_PLUGINS"].split(","):
+            package_tasks_location = importlib.util.find_spec(f"{plugin}.tasks").submodule_search_locations[0]
+            eval_logger.info(f"Including path: {args.include_path}")
+            include_path(package_tasks_location)
+
     if args.tasks is None:
         task_names = ALL_TASKS
     elif args.tasks == "list":
diff --git a/lmms_eval/evaluator.py b/lmms_eval/evaluator.py
@@ -325,7 +325,12 @@ def evaluate(
             # hack: remove image columns to speed avoid loading images and speed up postprocessing
             # reason: doc_iterator will actually load image if it's in the doc.
             docs = task.test_docs() if task.has_test_docs() else task.validation_docs()
-            if "d170" not in task_name and "dc100" not in task_name and "dc200" not in task_name and "llava_wilder" not in task_name and "live_bench" not in task_name:
+            if "d170" not in task_name \
+                and "dc100" not in task_name \
+                and "dc200" not in task_name \
+                and "llava_wilder" not in task_name \
+                and "livebench" not in task_name \
+                and "wildvision" not in task_name:
                 remove_cols = []
                 features = docs.features
                 # If it is an Image instance or a Sequence of Image instance. Remove it
diff --git a/lmms_eval/models/__init__.py b/lmms_eval/models/__init__.py
@@ -1,3 +1,6 @@
+import importlib
+import os
+import hf_transfer
 from loguru import logger
 import sys
 
@@ -28,6 +31,8 @@
     "mplug_owl_video": "mplug_Owl",
     "phi3v": "Phi3v",
     "tinyllava": "TinyLlava",
+    "llava_hf": "LlavaHf",
+    "longva": "LongVA",
     "llava_onevision": "Llava_OneVision",
     "llava_hf": "LlavaHf",
     "longva": "LongVA",
@@ -39,3 +44,17 @@
     except ImportError as e:
         # logger.warning(f"Failed to import {model_class} from {model_name}: {e}")
         pass
+
+if os.environ.get("LMMS_EVAL_PLUGINS", None):
+    # Allow specifying other packages to import models from
+    for plugin in os.environ["LMMS_EVAL_PLUGINS"].split(","):
+        m = importlib.import_module(f"{plugin}.models")
+        for model_name, model_class in getattr(m, "AVAILABLE_MODELS").items():
+            try:
+                exec(f"from {plugin}.models.{model_name} import {model_class}")
+            except ImportError:
+                pass
+
+import hf_transfer
+
+os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
diff --git a/lmms_eval/models/longva.py b/lmms_eval/models/longva.py
@@ -458,4 +458,4 @@ def _collate(x):
         res = re_ords.get_original(res)
 
         pbar.close()
-        return res
+        return res
diff --git a/lmms_eval/tasks/activitynetqa/activitynetqa_generation.yaml b/lmms_eval/tasks/activitynetqa/activitynetqa_generation.yaml
@@ -1,4 +1,3 @@
-dataset_name: "Generation"
 task: "activitynetqa"
 test_split: test
 output_type: generate_until
diff --git a/lmms_eval/tasks/videomme/videomme_w_subtitle.yaml b/lmms_eval/tasks/videomme/videomme_w_subtitle.yaml
@@ -26,6 +26,11 @@ metric_list:
 model_specific_prompt_kwargs:
   default:
     frame_num: 32
+<<<<<<< HEAD
+    pre_prompt: ""
+    post_prompt: "\nAnswer the question using a single word or phrase."
+=======
+>>>>>>> internal_main_dev
   gemini_api:
     gemini_api_flag: "full subtitle"
   # gpt4v:
diff --git a/lmms_eval/tasks/websrc/utils.py b/lmms_eval/tasks/websrc/utils.py
@@ -63,7 +63,15 @@ def websrc_test_aggregate_results_for_submission(results, args):
         for result in results:
             out.update(result)
         json.dump(out, f, indent=4)
+<<<<<<< HEAD
+<<<<<<< HEAD
     eval_logger.info(f"Results saved to {path}.")
+=======
+    lmms_logger.info(f"Results saved to {path}.")
+>>>>>>> internal_main_dev
+=======
+    eval_logger.info(f"Results saved to {path}.")
+>>>>>>> internal_main_dev
 
 
 def websrc_aggregate_results(results):
diff --git a/lmms_eval/tasks/wild_vision_bench/_default_template_yaml b/lmms_eval/tasks/wild_vision_bench/_default_template_yaml
@@ -0,0 +1,23 @@
+dataset_path: WildVision/wildvision-arena-data
+dataset_kwargs:
+  token: True
+output_type: generate_until
+doc_to_visual: !function utils.wild_vision_doc_to_visual
+doc_to_text: !function utils.wild_vision_doc_to_text
+doc_to_target: !function utils.wild_vision_doc_to_target
+generation_kwargs:
+  max_new_tokens: 4096
+  temperature: 0
+  top_p: 1.0
+  num_beams: 1
+  do_sample: false
+# The return value of process_results will be used by metrics
+process_results: !function utils.wild_vision_process_results
+# Note that the metric name can be either a registed metric function (such as the case for GQA) or a key name returned by process_results
+metric_list:
+  - metric: gpt_eval_score
+    aggregation: !function utils.wild_vision_aggregation
+    higher_is_better: true
+metadata:
+  judge_model: gpt-4o
+  baseline_model: claude-3-sonnet-20240229
diff --git a/lmms_eval/tasks/wild_vision_bench/utils.py b/lmms_eval/tasks/wild_vision_bench/utils.py
@@ -0,0 +1,188 @@
+import json
+import re
+import os
+import requests
+import numpy as np
+import time
+import yaml
+from pathlib import Path
+from copy import deepcopy
+from io import BytesIO
+import base64
+
+from loguru import logger as eval_logger
+
+NUM_SECONDS_TO_SLEEP = 5
+
+
+with open(Path(__file__).parent / "_default_template_yaml", "r") as f:
+    raw_data = f.readlines()
+    safe_data = []
+    for i, line in enumerate(raw_data):
+        # remove function definition since yaml load cannot handle it
+        if "!function" not in line:
+            safe_data.append(line)
+
+    config = yaml.safe_load("".join(safe_data))
+
+GPT_EVAL_MODEL_NAME = config["metadata"]["judge_model"]
+BASELINE_MODEL_NAME = config["metadata"]["baseline_model"]
+
+API_TYPE = os.getenv("API_TYPE", "openai")
+
+if API_TYPE == "openai":
+    API_URL = os.getenv("OPENAI_API_URL", "https://api.openai.com/v1/chat/completions")
+    API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_API_KEY")
+    headers = {
+        "Authorization": f"Bearer {API_KEY}",
+        "Content-Type": "application/json",
+    }
+elif API_TYPE == "azure":
+    API_URL = os.getenv("AZURE_ENDPOINT", "https://api.cognitive.microsoft.com/sts/v1.0/issueToken")
+    API_KEY = os.getenv("AZURE_API_KEY", "YOUR_API_KEY")
+    headers = {
+        "api-key": API_KEY,
+        "Content-Type": "application/json",
+    }
+
+system_prompt = """\
+Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user prompt displayed below. You will be given assistant A's answer and assistant B's answer. Your job is to evaluate which assistant's answer is better.
+
+Begin your evaluation by generating your own answer to the prompt. You must provide your answers before judging any answers.
+
+When evaluating the assistants' answers, compare both assistants' answers with your answer. You must identify and correct any mistakes or inaccurate information.
+
+Then consider if the assistant's answers are helpful, relevant, and concise. Helpful means the answer correctly responds to the prompt or follows the instructions. Note when user prompt has any ambiguity or more than one interpretation, it is more helpful and appropriate to ask for clarifications or more information from the user than providing an answer based on assumptions. Relevant means all parts of the response closely connect or are appropriate to what is being asked. Concise means the response is clear and not verbose or excessive.
+
+Then consider the creativity and novelty of the assistant's answers when needed. Finally, identify any missing important information in the assistants' answers that would be beneficial to include when responding to the user prompt.
+
+After providing your explanation, you must output only one of the following choices as your final verdict with a label:
+
+1. Assistant A is significantly better: [[A>>B]]
+2. Assistant A is slightly better: [[A>B]]
+3. Tie, relatively the same: [[A=B]]
+4. Assistant B is slightly better: [[B>A]]
+5. Assistant B is significantly better: [[B>>A]]
+
+Example output: "My final verdict is tie: [[A=B]]".\
+"""
+
+prompt_template = "<|User Prompt|>\n{question_1}\n\n<|The Start of Assistant A's Answer|>\n{answer_1}\n<|The End of Assistant A's Answer|>\n\n<|The Start of Assistant B's Answer|>\n{answer_2}\n<|The End of Assistant B's Answer|>"
+
+def get_chat_response(base64_image, prompt, max_retries=5, wait_time=10):
+    headers = {
+        "Authorization": f"Bearer {API_KEY}",
+        "Content-Type": "application/json",
+    }
+
+    payload = {
+        "model": GPT_EVAL_MODEL_NAME,
+        "messages": [
+            {"role": "system", "content": [{"type": "text", "text": system_prompt}]},
+            {
+                "role": "user",
+                "content": [
+                    {"type": "text", "text": prompt},
+                    {"type": "image_url",
+                        "image_url" : {
+                            "url" : f"data:image/jpeg;base64, {base64_image}"
+                            }
+                    },
+                ],
+            }
+        ],
+        "max_tokens": 1024,
+        "temperature": 0.0,
+    }
+
+    for attempt in range(max_retries):
+        try:
+            response = requests.post(API_URL, headers=headers, json=payload, timeout=60)
+            response.raise_for_status()
+            response_data = response.json()
+            return response_data["choices"][0]["message"]["content"], GPT_EVAL_MODEL_NAME
+        except requests.exceptions.RequestException as e:
+            print(f"Request failed on attempt {attempt+1}: {e}")
+            if attempt == max_retries - 1:
+                print(f"Failed to get response after {max_retries} attempts")
+                return "", GPT_EVAL_MODEL_NAME
+        except Exception as e:
+            print(f"Error on attempt {attempt+1}: {e}")
+            return "", GPT_EVAL_MODEL_NAME
+
+
+
+def image_to_base64(pil_image):
+    buffered = BytesIO()
+    pil_image.save(buffered, format="PNG")
+    return base64.b64encode(buffered.getvalue()).decode("utf-8")
+
+def get_score(judgement, pattern, pairwise=True):
+    matches = pattern.findall(judgement)
+    matches = [m for m in matches if m != ""]
+    if len(set(matches)) == 0:
+        return None, True
+    elif len(set(matches)) == 1:
+        if pairwise:
+            return matches[0].strip("\n"), False
+        return int(matches[0])
+    else:
+        return None, False
+
+def wild_vision_doc_to_visual(doc):
+    return [doc["image"].convert('RGB')]
+
+
+def wild_vision_doc_to_text(doc, model_specific_prompt_kwargs=None):
+    question = doc["instruction"].strip()
+    if "pre_prompt" in model_specific_prompt_kwargs and model_specific_prompt_kwargs["pre_prompt"] != "":
+        question = f"{model_specific_prompt_kwargs['pre_prompt']}{question}"
+    if "post_prompt" in model_specific_prompt_kwargs and model_specific_prompt_kwargs["post_prompt"] != "":
+        question = f"{question}{model_specific_prompt_kwargs['post_prompt']}"
+    return question
+
+def wild_vision_doc_to_target(doc):
+    return doc[BASELINE_MODEL_NAME]
+
+
+def wild_vision_process_results(doc, results):
+    pred = results[0]
+    user_prompt = prompt_template.format(question_1=doc["instruction"], answer_1=doc[BASELINE_MODEL_NAME], answer_2=pred)
+    base64_image = image_to_base64(doc["image"])
+    resps, gpt_name = get_chat_response(base64_image, user_prompt)
+    score, _ = get_score(resps, pattern=re.compile("\[\[([AB<>=]+)\]\]"))
+    
+    if score is None:
+        score = resps
+    
+    if "A>B" in score:
+        final_score = -1
+        judgement = "Worse" #Baseline better
+    elif "A>>B" in score:
+        final_score = -2
+        judgement = "Worse++"
+    elif "A=B" in score:
+        final_score = 0 
+        judgement = "Tie"
+    elif "B>A" in score:
+        final_score = 1
+        judgement = "Better"
+    elif "B>>A" in score:
+        final_score = 2
+        judgement = "Better++"
+    else:
+        final_score = 0
+        judgement = "Unclear"
+
+
+    return {"gpt_eval_score" : {"question" : doc["instruction"], "score" : final_score, "gpt_resps" : resps, "ans_1" : doc[BASELINE_MODEL_NAME], "ans_2" : pred, "filtered_resps" : score, "judgement" : judgement}}
+
+
+def wild_vision_aggregation(results):
+    score = 0
+    for res in results:
+        score += res["score"]
+    
+    return score / len(results)
+
+
diff --git a/lmms_eval/tasks/wild_vision_bench/wild_vision_bench0617.yaml b/lmms_eval/tasks/wild_vision_bench/wild_vision_bench0617.yaml
@@ -0,0 +1,9 @@
+task: wildvision_0617
+dataset_name: release_bench_0617_with_modelresponse 
+test_split: test500
+output_type: generate_until
+include: _default_template_yaml
+model_specific_prompt_kwargs:
+  default:
+    pre_prompt: ""
+    post_prompt: ""

Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,3 @@`
`1`		`-dataset_name: "Generation"`
`2`	`1`	`task: "activitynetqa"`
`3`	`2`	`test_split: test`
`4`	`3`	`output_type: generate_until`