Hi, thanks for your great work. I am reproducing the evaluation results with the latest codebase and also the latest LLaVA codebase. The results of other benchmarks are matched or have minor differences. However, the performance with MM-Vet is very low. Could you please check the evaluation with MM-Vet from your side? Or could you please tell me what I should be care of? Thank you!
| Tasks |
Version |
Filter |
n-shot |
Metric |
Value |
|
Stderr |
| mmvet |
Yaml |
none |
0 |
gpt_eval_score |
1.3761 |
± |
N/A |