Skip to content

Commit 22a4958

Browse files
author
Bo Li
committed
[WIP] adding mmbench dev evaluation (#75)
* WIP * Update GPT evaluation model name and sys prompt * 🛠️ Scale accuracy to percentage The accuracy value is now multiplied by 100 in the aggregation function to represent it as a percentage. Regarding the evaluation process, `math` module importation and refactoring reduce progress log verbosity by logging every 100 evaluations instead of 10. It prevents potential logging overflow. Handling of NaN values is added to ensure 'default_value' is set in case of missing data, avoiding errors in split, category, and l2-category assignments. Finally, reporting of categorical and l2-categorical accuracies is streamlined through a new `calculate_hit_rates` function, improving code readability and maintenance. Issue refs: #1427, #1533 * Update GPT evaluation model name and API configuration * Refactor MMBench_Evaluator class to handle missing columns * Add print statements for detailed results in MMBench-CN(CC), MMBench-CN(Dev), and MMBench-EN(Dev) evaluations * Refactor MMBench-CN and MMBench-EN evaluation functions * 🔄 Refactor result processing and logging logic - Simplified the result processing functions across different utility modules (`cc_utils.py`, `cn_utils.py`, `en_utils.py`) to unify the handling of multiple-choice options. Now, all options ("A" to "E") are dynamically added to the result data, and default to "nan" if not provided in the document. - Removed redundant keys directly from the process results dict creation to avoid clutter and align with the new dynamic addition of options. - In `mmbench_evals.py`, removed the unnecessary check for all splits being 'dev' and streamlined the evaluation loop by eliminating the progress bar (tqdm) for a cleaner log output. - Commented-out code and verbose logging during evaluation, which may have interfered with performance, has been removed for a more efficient and less intrusive logging experience. This cleanup reduces redundancy in the codebase and improves evaluation performance. Refs #2045 --------- Co-authored-by: Bo Li <[email protected]> (cherry picked from commit a19278c)
1 parent 70cc773 commit 22a4958

File tree

10 files changed

+439
-19
lines changed

10 files changed

+439
-19
lines changed

lmms_eval/tasks/mmbench/cc_utils.py

Lines changed: 37 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
from lmms_eval.tasks.mmbench.mmbench_evals import MMBench_Evaluator
1010
from lmms_eval.tasks._task_utils.file_utils import generate_submission_file
1111

12-
with open(Path(__file__).parent / "mmbench_cn.yaml", "r") as f:
12+
with open(Path(__file__).parent / "mmbench.yaml", "r") as f:
1313
raw_data = f.readlines()
1414
safe_data = []
1515
for i, line in enumerate(raw_data):
@@ -19,7 +19,18 @@
1919

2020
config = yaml.safe_load("".join(safe_data))
2121

22-
mmbench_evaluator = MMBench_Evaluator(sys_prompt=config["metadata"]["sys_prompt"])
22+
GPT_EVAL_MODEL_NAME = config["metadata"]["gpt_eval_model_name"]
23+
API_TYPE = os.getenv("API_TYPE", "openai")
24+
25+
if API_TYPE == "openai":
26+
API_URL = os.getenv("OPENAI_API_URL", "https://api.openai.com/v1/chat/completions")
27+
API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_API_KEY")
28+
elif API_TYPE == "azure":
29+
API_URL = os.getenv("AZURE_ENDPOINT", "https://api.cognitive.microsoft.com/sts/v1.0/issueToken")
30+
API_KEY = os.getenv("AZURE_API_KEY", "YOUR_API_KEY")
31+
32+
33+
mmbench_evaluator = MMBench_Evaluator(sys_prompt=config["metadata"]["sys_prompt"], API_KEY=API_KEY, API_URL=API_URL, model_version=GPT_EVAL_MODEL_NAME)
2334

2435

2536
def mmbench_doc_to_visual(doc):
@@ -52,21 +63,44 @@ def mmbench_cn_cc_doc_to_text(doc, model_specific_prompt_kwargs=None):
5263
def mmbench_cn_cc_process_results(doc, results):
5364
model_response = results[0].strip()
5465
data = {
66+
"gpt_eval_score": {
67+
"index": doc["index"],
68+
"question": doc["question"],
69+
"answer": doc["answer"],
70+
"prediction": model_response,
71+
"source": doc["source"],
72+
"category": doc["category"],
73+
},
5574
"submission": {
5675
"index": doc["index"],
5776
"question": doc["question"],
5877
"answer": doc["answer"],
5978
"prediction": model_response,
6079
"source": doc["source"],
6180
"category": doc["category"],
62-
}
81+
},
6382
}
6483
option_candidate = ["A", "B", "C", "D", "E"]
6584
for c in option_candidate:
6685
data["submission"][c] = doc.get(c, "nan")
86+
data["gpt_eval_score"][c] = doc.get(c, "nan")
6787
return data
6888

6989

90+
def mmbench_cn_cc_aggregate_dev_results_eval(results, args):
91+
print(f"============= MMBench-CN(CC) Detailed Results =============")
92+
overall_acc, category_acc, l2_category_acc = mmbench_evaluator.eval_result(results, eval_method="openai")
93+
file = generate_submission_file("mmbench_cn_cc_results.json", args)
94+
details_info = {
95+
"overall_acc": overall_acc,
96+
"category_acc": category_acc,
97+
"l2_category_acc": l2_category_acc,
98+
}
99+
with open(file, "w") as f:
100+
json.dump(details_info, f)
101+
return overall_acc * 100
102+
103+
70104
def mmbench_cn_cc_aggregate_results(results, args):
71105
df = pd.DataFrame(results)
72106
file = generate_submission_file("mmbench_cn_cc_results.xlsx", args)

lmms_eval/tasks/mmbench/cn_utils.py

Lines changed: 33 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,9 @@
88

99
eval_logger = logging.getLogger("lmms-eval")
1010
from lmms_eval.tasks.mmbench.mmbench_evals import MMBench_Evaluator
11+
from lmms_eval.tasks._task_utils.file_utils import generate_submission_file
1112

12-
with open(Path(__file__).parent / "mmbench_cn.yaml", "r") as f:
13+
with open(Path(__file__).parent / "mmbench.yaml", "r") as f:
1314
raw_data = f.readlines()
1415
safe_data = []
1516
for i, line in enumerate(raw_data):
@@ -19,7 +20,18 @@
1920

2021
config = yaml.safe_load("".join(safe_data))
2122

22-
mmbench_evaluator = MMBench_Evaluator(sys_prompt=config["metadata"]["sys_prompt"])
23+
GPT_EVAL_MODEL_NAME = config["metadata"]["gpt_eval_model_name"]
24+
API_TYPE = os.getenv("API_TYPE", "openai")
25+
26+
if API_TYPE == "openai":
27+
API_URL = os.getenv("OPENAI_API_URL", "https://api.openai.com/v1/chat/completions")
28+
API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_API_KEY")
29+
elif API_TYPE == "azure":
30+
API_URL = os.getenv("AZURE_ENDPOINT", "https://api.cognitive.microsoft.com/sts/v1.0/issueToken")
31+
API_KEY = os.getenv("AZURE_API_KEY", "YOUR_API_KEY")
32+
33+
34+
mmbench_evaluator = MMBench_Evaluator(sys_prompt=config["metadata"]["sys_prompt"], API_KEY=API_KEY, API_URL=API_URL, model_version=GPT_EVAL_MODEL_NAME)
2335

2436

2537
def mmbench_doc_to_visual(doc):
@@ -55,6 +67,17 @@ def mmbench_doc_to_text(doc, model_specific_prompt_kwargs=None):
5567
def mmbench_process_results(doc, results):
5668
model_response = results[0].strip()
5769
data = {
70+
"gpt_eval_score": {
71+
"index": doc["index"],
72+
"question": doc["question"],
73+
"answer": doc["answer"],
74+
"prediction": model_response,
75+
"hint": doc["hint"],
76+
"source": doc["source"],
77+
"split": doc["split"],
78+
"category": doc["category"],
79+
"L2-category": doc["L2-category"],
80+
},
5881
"submission": {
5982
"index": doc["index"],
6083
"question": doc["question"],
@@ -65,14 +88,21 @@ def mmbench_process_results(doc, results):
6588
"split": doc["split"],
6689
"category": doc["category"],
6790
"L2-category": doc["L2-category"],
68-
}
91+
},
6992
}
7093
option_candidate = ["A", "B", "C", "D", "E"]
7194
for c in option_candidate:
7295
data["submission"][c] = doc.get(c, "nan")
96+
data["gpt_eval_score"][c] = doc.get(c, "nan")
7397
return data
7498

7599

100+
def mmbench_aggregate_dev_results_eval(results, args):
101+
print(f"============= MMBench-CN(Dev) Detailed Results =============")
102+
accuracy = mmbench_evaluator.eval_result(results, eval_method="openai")
103+
return accuracy * 100
104+
105+
76106
def mmbench_aggregate_dev_results(results, args):
77107
df = pd.DataFrame(results)
78108
excel_write_path = generate_submission_file("mmbench_cn_dev_results.xlsx", args)

lmms_eval/tasks/mmbench/en_utils.py

Lines changed: 41 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
from lmms_eval.tasks.mmbench.mmbench_evals import MMBench_Evaluator
1010
from lmms_eval.tasks._task_utils.file_utils import generate_submission_file
1111

12-
with open(Path(__file__).parent / "mmbench_en.yaml", "r") as f:
12+
with open(Path(__file__).parent / "mmbench.yaml", "r") as f:
1313
raw_data = f.readlines()
1414
safe_data = []
1515
for i, line in enumerate(raw_data):
@@ -19,7 +19,18 @@
1919

2020
config = yaml.safe_load("".join(safe_data))
2121

22-
mmbench_evaluator = MMBench_Evaluator(sys_prompt=config["metadata"]["sys_prompt"])
22+
GPT_EVAL_MODEL_NAME = config["metadata"]["gpt_eval_model_name"]
23+
API_TYPE = os.getenv("API_TYPE", "openai")
24+
25+
if API_TYPE == "openai":
26+
API_URL = os.getenv("OPENAI_API_URL", "https://api.openai.com/v1/chat/completions")
27+
API_KEY = os.getenv("OPENAI_API_KEY", "YOUR_API_KEY")
28+
elif API_TYPE == "azure":
29+
API_URL = os.getenv("AZURE_ENDPOINT", "https://api.cognitive.microsoft.com/sts/v1.0/issueToken")
30+
API_KEY = os.getenv("AZURE_API_KEY", "YOUR_API_KEY")
31+
32+
33+
mmbench_evaluator = MMBench_Evaluator(sys_prompt=config["metadata"]["sys_prompt"], API_KEY=API_KEY, API_URL=API_URL, model_version=GPT_EVAL_MODEL_NAME)
2334

2435

2536
def mmbench_doc_to_visual(doc):
@@ -55,6 +66,17 @@ def mmbench_doc_to_text(doc, model_specific_prompt_kwargs=None):
5566
def mmbench_process_results(doc, results):
5667
model_response = results[0].strip()
5768
data = {
69+
"gpt_eval_score": {
70+
"index": doc["index"],
71+
"question": doc["question"],
72+
"answer": doc["answer"],
73+
"prediction": model_response,
74+
"hint": doc["hint"],
75+
"source": doc["source"],
76+
"split": doc["split"],
77+
"category": doc["category"],
78+
"L2-category": doc["L2-category"],
79+
},
5880
"submission": {
5981
"index": doc["index"],
6082
"question": doc["question"],
@@ -65,15 +87,30 @@ def mmbench_process_results(doc, results):
6587
"split": doc["split"],
6688
"category": doc["category"],
6789
"L2-category": doc["L2-category"],
68-
}
90+
},
6991
}
7092
option_candidate = ["A", "B", "C", "D", "E"]
7193
for c in option_candidate:
7294
data["submission"][c] = doc.get(c, "nan")
95+
data["gpt_eval_score"][c] = doc.get(c, "nan")
7396
return data
7497

7598

76-
def mmbench_aggregate_dev_results(results, args):
99+
def mmbench_aggregate_dev_results_eval(results, args):
100+
print(f"============= MMBench-EN(Dev) Detailed Results =============")
101+
overall_acc, category_acc, l2_category_acc = mmbench_evaluator.eval_result(results, eval_method="openai")
102+
file = generate_submission_file("mmbench_en_dev_results.json", args)
103+
details_info = {
104+
"overall_acc": overall_acc,
105+
"category_acc": category_acc,
106+
"l2_category_acc": l2_category_acc,
107+
}
108+
with open(file, "w") as f:
109+
json.dump(details_info, f)
110+
return overall_acc * 100
111+
112+
113+
def mmbench_aggregate_dev_results_submission(results, args):
77114
df = pd.DataFrame(results)
78115
excel_write_path = generate_submission_file("mmbench_en_dev_results.xlsx", args)
79116
with pd.ExcelWriter(excel_write_path) as writer:

lmms_eval/tasks/mmbench/mmbench.yaml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,8 @@ task:
44
- mmbench_en_test
55
- mmbench_cn_dev
66
- mmbench_cn_test
7-
- mmbench_cn_cc
7+
- mmbench_cn_cc
8+
metadata:
9+
version: 0.0
10+
sys_prompt: "There are several options:"
11+
gpt_eval_model_name: "gpt-3.5-turbo-0613"

lmms_eval/tasks/mmbench/mmbench_cc.yaml

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,12 +16,14 @@ generation_kwargs:
1616
do_sample: false
1717
process_results: !function cc_utils.mmbench_cn_cc_process_results
1818
metric_list:
19+
- metric: gpt_eval_score
20+
aggregation: !function cc_utils.mmbench_cn_cc_aggregate_dev_results_eval
21+
higher_is_better: true
1922
- metric: submission
2023
aggregation: !function cc_utils.mmbench_cn_cc_aggregate_results
2124
metadata:
2225
version: 0.0
23-
gpt_eval_model_name: "gpt-3.5-turbo"
24-
quick_extract: true
26+
gpt_eval_model_name: "gpt-3.5-turbo-0613"
2527

2628
model_specific_prompt_kwargs:
2729
default:

lmms_eval/tasks/mmbench/mmbench_cn.yaml

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,5 @@ task:
55
- mmbench_cn_cc
66
metadata:
77
version: 0.0
8-
gpt_eval_model_name: "gpt-3.5-turbo"
9-
quick_extract: true
10-
sys_prompt: "有如下几个选项:"
8+
gpt_eval_model_name: "gpt-3.5-turbo-0613"
9+
sys_prompt: "有如下几个选项:"

lmms_eval/tasks/mmbench/mmbench_cn_dev.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,9 @@
11
task: "mmbench_cn_dev"
22
test_split: "dev"
33
metric_list:
4+
- metric: gpt_eval_score
5+
aggregation: !function cn_utils.mmbench_aggregate_dev_results_eval
6+
higher_is_better: true
47
- metric: submission
58
higher_is_better: true
69
aggregation: !function cn_utils.mmbench_aggregate_dev_results

lmms_eval/tasks/mmbench/mmbench_en.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,3 +5,4 @@ task:
55
metadata:
66
version: 0.0
77
sys_prompt: "There are several options:"
8+
gpt_eval_model_name: "gpt-3.5-turbo-0613"

lmms_eval/tasks/mmbench/mmbench_en_dev.yaml

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,9 @@ task: "mmbench_en_dev"
22
test_split: dev
33
include: _default_template_mmbench_en_yaml
44
metric_list:
5-
- metric: submission
6-
aggregation: !function en_utils.mmbench_aggregate_dev_results
5+
- metric: gpt_eval_score
6+
aggregation: !function en_utils.mmbench_aggregate_dev_results_eval
77
higher_is_better: true
8+
- metric: submission
9+
aggregation: !function en_utils.mmbench_aggregate_dev_results_submission
10+
higher_is_better: true

0 commit comments

Comments
 (0)