Add Winogrande evaluation #5015
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
It is not the most efficient implementation, but a) I wanted to have something that looks like to be working 1st, and b) evaluation time is not that long (70 seconds for the 1267 tasks of the Winogrande evaluation dataset for Mistral-7B using CUDA on RTX-4080), so performance improvements are not as important as they would be for HellaSwag.
I'm not quite getting the scores reported on the HF Leader board (HFLB). For the Winogrande evaluation dataset (see https://huggingface.co/datasets/ikawrakow/winogrande-eval-for-llama.cpp), which contains 1267 tasks, I get
73.56
vs78.37
reported on HFLB for Mistral-7B. Statistical uncertainty (1-sigma) is1.24
, so there is a tiny chance that this could be simply statistics. On the other hand, we do get lower HellaSwag scores compared to HFLB as well, so this is kind of expected.Interestingly enough, the Winogrande score varies quite a bit, depending on what parts of the context are included when computing the average log-likelihood, so perhaps I haven't found the right magic subset of tokens that maximizes the score.
Usage:
If
--winogrande-tasks
is omitted, all tasks in the dataset will be evaluated.Update:
I ran Winogrande on the extra large Winogrande training dataset (40397 tasks) with Mistral-7B. I get
83.79 +/- 0.18
, which is significantly higher than the HFLB value. Getting a higher value is expected, as this is training data and models have most likely been trained on this dataset, but it still gives confidence that the implementation is correct.