Wietse de Vries, Martijn Bartelds, Malvina Nissim and Martijn Wieling
Large pre-trained language models are obtained from massive amounts of data that exists for high-resource languages, but are not available for the majority of the world’s languages. Multilingual BERT (mBERT; Devlin et al. 2019) has been found to generalize across languages with high zero-shot transfer performance on a variety of tasks (Pires et al., 2019; Wu and Dredze, 2019), but low-resource languages not included in mBERT pre-training usually show poor performance (Nozza et al., 2020; Wu and Dredze, 2020).
An alternative to multilingual transfer learning is the adaptation of existing monolingual models to other languages. Focusing on two regional language varieties of the North of the Netherlands, namely Gronings (Low Saxon language variant) and West Frisian, we investigate the influence of language similarity in the choice of the foundational monolingual model for zero-shot transfer learning with as little data as possible in the task of POS tagging (for which we have a small collection of annotated data for the target languages).
We use three monolingual BERT models of West Germanic languages closely related to our target languages: English, German, Dutch (and mBERT). We quantify similarity between each source language and the target languages using the (lexical-phonetic) LDND measure (Wichmann et al., 2010). A syntax-based measure may be preferred, but it is not available for our language varieties. On the basis of LDND, we expect Gronings and West Frisian to profit most from Dutch and least from English, with a German model in-between.
Our training procedure consists of two separate steps. The Transformer layers in the monolingual models and mBERT are fine-tuned for POS-tagging on Universal Dependencies POS-annotated treebanks in the model’s source language. Independently, new lexical layers for each BERT model are trained on unlabeled target language data with a masked language modeling pre-training objective. Afterwards, the retrained lexical layer and the fine-tuned Transformer layers are combined to yield a POS-tagging model, now adapted to the target language.
The monolingual language models with the original lexical layers perform poorly on Gronings and West Frisian. mBERT with its original lexical layer achieves better results, but only West Frisian performance is comparable to the source language performance. West Frisian was included in mBERT pre-training, suggesting that mBERT might serve languages included in pre-training well, but may be less suitable for those not included, like Gronings. For all monolingual models, task performance greatly improves by retraining the lexical layer for Gronings and West Frisian. Best results are obtained by (Dutch) BERTje fine-tuned on the Alpino dataset (92.4% for Gronings, 95.4% for West Frisian). (English) BERT yields the worst performance. Performance scores and the linguistic distance from Gronings and West Frisian to the source languages strongly correlate (r = −0.85, p < 0.05), suggesting that measures of linguistic distance can guide the optimal choice of monolingual models. With high language similarity, we also observe that 10MB of unlabeled data seems sufficient to achieve substantial monolingual transfer performance.
Our pre-trained models for Gronings and West Frisian are released. Code is available at: https://github.com/wietsedv/ low-resource-adapt