Multilingual language models seem to be getting better, but how do we know? In general, language model evaluation is made more uncertain by automatic evaluations which correlate poorly with human ratings, low-quality datasets, and a lack of reproducibility. But for languages other than high-resource languages like English and Mandarin Chinese, these problems are even more consequential. We provide a set of best practices for using existing evaluations. Given the limited number of evaluations for many languages, we highlight languages and tasks that need more benchmarks and outline key considerations for developing new multilingual benchmarks.