Ahmet Üstün, Arianna Bisazza, Gosse Bouma and Gertjan van Noord
Recent advances in massively multilingual pre-training such as multilingual BERT (Devlin et al. 2019) and XLM-RoBERTa (Conneau et al. 2020), allow building high-quality multilingual parsers by simply fine-tuning large pre-trained LMs on the concatenation of multiple datasets from different languages. This considerably improves low-resource and zero-shot generalization but is limited by the “transfer – interference trade-off” which degrades the high-resource performance.
In our previous work, we proposed a novel multilingual parser, UDapter (Üstün et al. 2020), that strikes a better balance between maximum sharing and language-specific capacity in multilingual models of dependency parsing. This parser learns to adapt its language-specific parameters, i.e. the adapter modules, as a function of learned language representations. Furthermore, this parser leverages a mix of linguistically curated typological features for language embeddings instead of learning them from scratch. This allows UDapter to work for languages that do not have any task-specific training data – i.e. zero-shot setting – as the model can project any vector of typological features to a language embedding for the adapter generation after fine-tuning.
However, especially for low-resource languages, typological features are not always available. For instance, out of 289 syntactic and phonological typology features; Kazakh, Belarussian, and Tamil have gold annotations only for 14, 19, and 224 features respectively. To address this problem, we modify UDapter so that it learns to predict such missing features jointly with dependency parsing. Specifically, we propose to mask a random set of typological features during training and predict those features together with the target word labels in a multi-task manner. During inference, the model is then capable of projecting incomplete typological feature vectors to language embeddings, thereby eliminating the need of an external typology prediction system.
Our experiments show that a UDapter enhanced with joint prediction achieves very competitive performance with the original parser while not using externally predicted features. We further analyze the correlation between the number of missing features and the difference in parsing accuracy for individual languages. Finally, we evaluate the feature prediction accuracy of the joint model, confirming that the relevant typological information can indeed be learned directly from task-specific training data.