The MaCoCu project: Massive Collection, Curation and Evaluation of Monolingual and Bilingual Data

Rik van Noord and Antonio Toral

We introduce the MaCoCu project: Massive collection and curation of monolingual and bilingual data. This project, which is funded by the Connecting Europe Facility, is a collaboration of 4 different partners: University of Groningen (us), Institut Jožef Stefan (Slovenia), University of Alicante (Spain) and Prompsit Language Engineering (Spain). It aims at building large and high-quality mono-lingual and parallel corpora for ten under-resourced European Languages: Albanian, Bulgarian, Croatian, Icelandic, Macedonian, Maltese, Montenegrin, Serbian, Slovenian and Turkish. Our strategy is to automatically crawl top-level domains, as opposed to existing resources that exploit Common Crawl (e.g. ParaCrawl, OSCAR) with the hypothesis that this leads to higher quality data.

Our work mainly focuses on the evaluation of the crawled corpora. We aim to show that the crawled data is indeed of high quality by doing a number of things. For one, we train Transformer-based Neural Machine Translation (NMT) systems on the parallel corpora. We compare a number of experimental settings: using just the MaCoCu data, and comparing adding the MaCoCU data to the largest currently available data sets to see if performance (still) improves. We evaluate performance across a number of languages, evaluation sets, domains and metrics and clearly find that the data is indeed of high-quality, with best performance for models that were at least partly trained on the MaCoCu data. In the next few weeks, we will carry out a human evaluation to (hopefully) corroborate these results.

Second, we train a number of pre-trained language models (LMs) on the mono-lingual data. For languages that do not have a strong LM already available, we train a RoBERTa-based model from scratch. Otherwise, we simply continue training the strongest available LM for that language (e.g. BERTurk, IceBERT, SloBERT) on the MaCoCu data, so as not to waste resources. For all languages, we compare the performance of the aforementioned monolingual LMs to continuing training from the multi-lingual LM XLM-R. We aim to answer the question whether it is necessary to train a fully mono-lingual model, or if fine-tuning a high-quality multi-lingual LM is preferable. We evaluate performance across a number of different tasks and evaluation sets and will release all models publicly. The evaluation for Bulgarian, Icelandic, Maltese, Slovene and Turkish has to be finished before May 30th, so we should have the results during CLIN32. We expect to have improved on the state-of-the-art for all these languages and that the released models will be very useful for the research community.