-
Notifications
You must be signed in to change notification settings - Fork 438
Description
Description
The regular expression used to extract answers for MMLU in common.py
fails when the pattern "Answer: LETTER" appears multiple times in the LLM output, affecting model performance.
Example
The following example demonstrates the issue with a German output. The model correctly selects "C", but the regex extracts "A" as the answer.
Explanation
The regular expression mistakenly only considers the first occurrence of "Answer: LETTER".
Lines 25 to 71 in a8e85cc
MULTILINGUAL_ANSWER_PATTERN_TEMPLATE = ( | |
"(?i){}\s*([A-D]|[أ-د]|[অ]|[ব]|[ড]|[ঢ]|[A]|[B]|[C]|[D])" | |
) | |
# All the different ways "Answer" is written in different languages | |
MULTILINGUAL_ANSWER_REGEXES = [ | |
"Answer\s*:", | |
"Answer\s*:", # Korean invisible character | |
"উত্তর\s*:", | |
"उत्तर\s*:", | |
"উত্তরঃ", | |
"উত্তর\s*:", | |
"Antwort\s*:", | |
"답변\s*:", | |
"정답\s*:", | |
"답\s*:", | |
"答案\s*:", | |
"答案\s*:", | |
"答\s*:", | |
"答\s*:", | |
"答复\s*:", | |
"答曰\s*:", | |
"الإجابة:", | |
"الجواب:", | |
"إجابة:", | |
"الإجابة النهائية:", | |
"الإجابة الصحيحة:", | |
"الإجابة الصحيحة هي:", | |
"الإجابة هي:", | |
"Respuesta\s*:", | |
"Risposta\s*:", | |
"答え\s*:", | |
"答え\s*:", | |
"回答\s*:", | |
"回答\s*:", | |
"解答\s*:", | |
"Jawaban\s*:", | |
"Réponse\s*:", | |
"Resposta\s*:", | |
"Jibu\s*:", | |
"Idahun\s*:", | |
"Ìdáhùn\s*:", | |
"Idáhùn\s*:", | |
"Àmọ̀nà\s*:", | |
"Àdáhùn\s*:", | |
"Ànúgọ\s*:", | |
"Àṣàyàn\s*:", | |
] |
In the German example above, it extracts the answer "A" from "Antwort:\n\nAntwort: C" because "Antwort:\n\nAntwort: C".
Impact
This bug significantly impacts the evaluation results for certain languages. In my experiments, German experienced this issue with ~20% of the samples, and Indonesian showed a ~4% impact. Other languages seem less affected.