The challenges of creating a corpus of minority languages and its dialects in Natural Language Processing: the case of the South American indigenous language Guarani

Yliana Rodríguez, Luis Chiruzzo and Santiago Góngora

In spite of their diversity, the Indigenous languages of the American continent have received little attention from the technological perspective (Mager, Gutierrez-Vasques, Sierra & Meza-Ruiz, 2018). Guarani is one of the most widely spoken native South American languages, hence, one could expect it to be . However, its presence in the web is scarce, even in Paraguayan websites, where Spanish is the predominant language. This phenomenon has been observed before when trying to build corpora for minority languages in other multilingual contexts, for example, a similar argument is presented by (Jauhiainen et al., 2020) when building a web-centered corpus for Uralic minority languages. Moreover, the longstaning discussion over the limits of dialects and languages is another challenge faced in NLP (Natural Language Processing). Even though Guarani is not a minority language in terms of its speakers, it is under-resourced (Krauwer, 2003) and under-researched from a computational linguistics perspective. Together with Spanish, Guarani is an official language of Paraguay, and it is also widely spoken by its non-indigenous population (Estigarribia, 2015). Its co-existence with Spanish resulted in the emergence of new varieties and language mixing, which can be traced back to colonial times in the Jesuits notes, e.g. Dobrizhoffer (1783). Guarani has only recently adopted a unique stable orthography and has a limited online presence, amongst other characteristics that make it difficult to work with from a computational viewpoint (lack of digital resources for language processing, bilingual electronic dictionaries, transcribed speech data, etc.). Although there have been efforts towards compiling monolingual resources for Guarani (Aguero-Torales et al., 2021; Secretaría de Políticas Lingüísticas del Paraguay, 2019; Rios et al., 2014), machine translation for the Guarani-Spanish pair, and the development of parallel data remain largely under-explored topics. The challenges to be presented were dealt with in the process of building a parallel corpus of Guarani-Spanish text aligned at sentence level. The corpus contains about 30,000 sentence pairs, and is structured as a collection of subsets from different sources, further split into training, development and test sets. A sample of sentences from the test set was manually annotated by native speakers in order to incorporate meta-linguistic annotations about the Guarani dialects present in the corpus and also the correctness of the alignment and translation. Some baseline Machine Translation experiments were carried out with the intention that the corpus is used as a benchmark for testing Guarani-Spanish MT systems, and to expand and improve the quality of the corpus in future iterations. The Guarani-Spanish language pair has been in contact for centuries, generating several contact varieties. We propose an outline of the challenges faced while building the corpus due to the scarce bilingual literature and its format, as well as a discussion regarding the distinction between the many varieties of Guarani spoken in Paraguay and its mixing with Spanish. The interdisciplinary spirit of this project is also a novelty for the field, i.e. joining forces from engineering and linguistics, especially when it comes to South American academia.