Skip to content

MMLU answer extraction regex fails with repeated "Answer: LETTER" pattern #33

@lucasresck

Description

@lucasresck

Description

The regular expression used to extract answers for MMLU in common.py fails when the pattern "Answer: LETTER" appears multiple times in the LLM output, affecting model performance.

Example

The following example demonstrates the issue with a German output. The model correctly selects "C", but the regex extracts "A" as the answer.

unnamed

Explanation

The regular expression mistakenly only considers the first occurrence of "Answer: LETTER".

simple-evals/common.py

Lines 25 to 71 in a8e85cc

MULTILINGUAL_ANSWER_PATTERN_TEMPLATE = (
"(?i){}\s*([A-D]|[أ-د]|[অ]|[ব]|[ড]|[ঢ]|[A]|[B]|[C]|[D])"
)
# All the different ways "Answer" is written in different languages
MULTILINGUAL_ANSWER_REGEXES = [
"Answer\s*:",
"Answer\s*:​​​​​​", # Korean invisible character
"উত্তর\s*:",
"उत्तर\s*:",
"উত্তরঃ",
"উত্তর\s*:",
"Antwort\s*:",
"답변\s*:",
"정답\s*:",
"답\s*:",
"答案\s*:",
"答案\s*:",
"答\s*:",
"答\s*:",
"答复\s*:",
"答曰\s*:",
"الإجابة:",
"الجواب:",
"إجابة:",
"الإجابة النهائية:",
"الإجابة الصحيحة:",
"الإجابة الصحيحة هي:",
"الإجابة هي:",
"Respuesta\s*:",
"Risposta\s*:",
"答え\s*:",
"答え\s*:",
"回答\s*:",
"回答\s*:",
"解答\s*:",
"Jawaban\s*:",
"Réponse\s*:",
"Resposta\s*:",
"Jibu\s*:",
"Idahun\s*:",
"Ìdáhùn\s*:",
"Idáhùn\s*:",
"Àmọ̀nà\s*:",
"Àdáhùn\s*:",
"Ànúgọ\s*:",
"Àṣàyàn\s*:",
]

In the German example above, it extracts the answer "A" from "Antwort:\n\nAntwort: C" because "Antwort:\n\nAntwort: C".

Impact

This bug significantly impacts the evaluation results for certain languages. In my experiments, German experienced this issue with ~20% of the samples, and Indonesian showed a ~4% impact. Other languages seem less affected.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions