PyTorch Day France

Best Practices for Open Multilingual LLM Evaluation

May 7

•

17:00 - 17:20

Location: Central Room (Updated)

Multilingual language models seem to be getting better, but how do we know? In general, language model evaluation is made more uncertain by automatic evaluations which correlate poorly with human ratings, low-quality datasets, and a lack of reproducibility. But for languages other than high-resource languages like English and Mandarin Chinese, these problems are even more consequential. We provide a set of best practices for using existing evaluations. Given the limited number of evaluations for many languages, we highlight languages and tasks that need more benchmarks and outline key considerations for developing new multilingual benchmarks.

Speakers

Catherine Arnett

NPL Researcher at EleutherAI