Not all zero-shot settings are created equal in multilingual NLP

Miryam de Lhoneux

The field of multilingual NLP is booming. This is due in no small part to large multilingual pretrained language models such as mBERT and XLM-R which have been found to have surprising cross-lingual transfer capabilities. A number of papers have made large improvements for multiple tasks in a “zero-shot” scenario, where “zero-shot” usually means that task data has not been used for a given target language (for example, mBERT is fine-tuned for POS tagging using English data and then directly applied to French). We highlight that the term zero-shot is used uniformly to describe situations where widely different amounts of relevant data for different target languages have been seen by the model, creating very different flavors of what it means for a scenario to be “zero-shot”. For example, an mBERT model fine-tuned on dependency parsing using a sample of treebanks that includes a Russian treebank performs very well on Belarusian test data but performs poorly on Yoruba and extremely poorly on Amharic. This is unsurprising since no data from any language related to Yoruba has been used for fine-tuning and the script of Amharic is completely unseen by the model.
We discuss advantages and limitations of existing experimental designs but, more importantly, call for a careful consideration of language properties in multilingual NLP zero-shot experiments.