You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* WIP
* Update GPT evaluation model name and sys prompt
* 🛠️ Scale accuracy to percentage
The accuracy value is now multiplied by 100 in the aggregation function to represent it as a percentage. Regarding the evaluation process, `math` module importation and refactoring reduce progress log verbosity by logging every 100 evaluations instead of 10. It prevents potential logging overflow. Handling of NaN values is added to ensure 'default_value' is set in case of missing data, avoiding errors in split, category, and l2-category assignments. Finally, reporting of categorical and l2-categorical accuracies is streamlined through a new `calculate_hit_rates` function, improving code readability and maintenance.
Issue refs: #1427, #1533
* Update GPT evaluation model name and API configuration
* Refactor MMBench_Evaluator class to handle missing columns
* Add print statements for detailed results in MMBench-CN(CC), MMBench-CN(Dev), and MMBench-EN(Dev) evaluations
* Refactor MMBench-CN and MMBench-EN evaluation functions
* 🔄 Refactor result processing and logging logic
- Simplified the result processing functions across different utility modules (`cc_utils.py`, `cn_utils.py`, `en_utils.py`) to unify the handling of multiple-choice options. Now, all options ("A" to "E") are dynamically added to the result data, and default to "nan" if not provided in the document.
- Removed redundant keys directly from the process results dict creation to avoid clutter and align with the new dynamic addition of options.
- In `mmbench_evals.py`, removed the unnecessary check for all splits being 'dev' and streamlined the evaluation loop by eliminating the progress bar (tqdm) for a cleaner log output.
- Commented-out code and verbose logging during evaluation, which may have interfered with performance, has been removed for a more efficient and less intrusive logging experience.
This cleanup reduces redundancy in the codebase and improves evaluation performance.
Refs #2045
---------
Co-authored-by: Bo Li <[email protected]>
(cherry picked from commit a19278c)
0 commit comments