Skip to content

SapienzaNLP/interpretable-mt-metrics-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Beyond Correlation: Interpretable Evaluation of Machine Translation Metrics

Conference Paper License: CC BY-NC-SA 4.0

Python 3.10+ Code style: black

This repository provides a research toolkit to run interpretable meta-evaluation of Machine Translation (MT) metrics. It offers commands for evaluating metrics capabilities across two metric tasks:

  • Data Filtering: Separate good-quality translations from poor-quality ones.
  • Translation Re‑Ranking: Identify the best translation in a pool of translations of the same source text.

Setup

pip install -e .

Data Filtering

Evaluate metrics capabilities in data filtering using the rank_metrics.py script passing data-filtering as the task parameter.

For example, run the evaluation using the WMT23 test set in the Chinese to English translation direction using the following command:

python scripts/py/rank_metrics.py \
    --testset-names wmt23 \
    --lps zh-en \
    --refs-to-use refA
    --task data-filtering \
    --average-by sys \
    --include-human \
    --include-outliers \
    --gold-name mqm \
    --gold-score-threshold -1 \ # Use -1 for PERFECT vs OTHER and -4 for GOOD vs BAD

For each metric, this command runs an optimization process to find the best score threshold to separate between GOOD and BAD translations, or between PERFECT and OTHER translations, depending on the gold-score-threshold passed as parameter. To speed it up, you can pass the --n-processes argument to set the number of processes you want to run in parallel. If left unset, the number of processes will be equal to the number of processors on your device.

This command works also with pre-computed metrics thresholds. To do so, use the --thresholds-from-json argument. This way, it will only compute Precision, Recall, and F1-scores, without running the slower optimization process.

Translation Re-Ranking

Evaluate metrics capabilities in translation re-ranking using the rank_metrics.py script passing translation-reranking as the task parameter.

For example, run the evaluation using the WMT23 test set in the Chinese to English translation direction using the following command:

python scripts/py/rank_metrics.py \
    --testset-names wmt23 \
    --lps zh-en \
    --refs-to-use refA
    --task translation-reranking \
    --include-human \
    --include-outliers \
    --gold-name mqm \

For each metric, this command measures its re-ranking precision as the number of times it is able to identify the best translation out of a pool of translations of the same source.

Cite us

If you find our paper or code useful, please reference this work:

@inproceedings{perrella-etal-2024-beyond,
    title = "Beyond Correlation: Interpretable Evaluation of Machine Translation Metrics",
    author = "Perrella, Stefano  and
      Proietti, Lorenzo  and
      Huguet Cabot, Pere-Llu{\'i}s  and
      Barba, Edoardo  and
      Navigli, Roberto",
    editor = "Al-Onaizan, Yaser  and
      Bansal, Mohit  and
      Chen, Yun-Nung",
    booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.emnlp-main.1152/",
    doi = "10.18653/v1/2024.emnlp-main.1152",
    pages = "20689--20714",
}

License

This work is under the Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published