Ayla Rigouts Terryn, Veronique Hoste and Els Lefever
Based on the D-TERMINE (Data-driven Term Extraction Methodologies Investigated) PhD research (Rigouts Terryn, 2021), an online demo is currently being developed for automatic term extraction. The D-Terminer demo can be found at https://lt3.ugent.be/dterminer/ and supports monolingual term extraction in English, French, Dutch, and German, as well as bilingual automatic term extraction from parallel corpora with pairs of those same languages. The service is open source (https://github.ugent.be/lt3/D-Terminer/) and free of charge, though restrictions apply to the maximum allowed volume of submitted texts.
The monolingual term extraction is based on a supervised method trained on the open source ACTER dataset. Using the Flair framework (Akbik et al., 2019), a recurrent neural network is trained to tag each sequential token in a domain-specific text as (part of) a term or not. Pre-trained multilingual BERT embeddings (Devlin et al., 2019) were used for this purpose. With the standard settings, a model trained on all three languages (English, French, Dutch) and four domains (corruption, dressage (equitation), heart failure, wind energy) of the ACTER dataset is applied to extract all terms (Specific Terms, Common Terms, and Out-of-Domain Terms) and Named Entities with IOB (Inside-Outside-Beginning) tagging. Users can customise settings to use models trained only on a subset of the domains, e.g., when their corpus resembles one of the training corpora. It is also possible to switch to binary (IO) instead of IOB-tagging and to focus on a subset of term labels. The current version supports the submission of one or more plain text files as a corpus.
For the bilingual automatic term extraction, a bilingual domain-specific corpus can be submitted as a TMX-file. The monolingual texts are derived from this file and monolingual term extraction is performed on each language separately, as described above. Using ASTrED word alignments (Vanroy et al., 2021) and frequency ratio, potentially equivalent candidate terms are aligned cross-lingually.
This is an ongoing project with future research plans for improvements in many directions, ranging from more advanced term extraction to more customisation and export options.
Akbik, A., Bergmann, T., Blythe, D., Rasul, K., Schweter, S., & Vollgraf, R. (2019). FLAIR: An Easy-to-Use Framework for State-of-the-Art NLP. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, 54–59.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv:1810.04805 [Cs]. http://arxiv.org/abs/1810.04805
Rigouts Terryn, A. (2021). D-TERMINE: Data-driven Term Extraction Methodologies Investigated [Doctoral thesis]. Ghent University.
Vanroy, B., De Clercq, O., Tezcan, A., Daems, J., & Macken, L. (2021). Metrics of Syntactic Equivalence to Assess Translation Difficulty. In M. Carl (Ed.), Explorations in Empirical Translation Process Research (Vol. 3, pp. 259–294). Springer International Publishing. https://doi.org/10.1007/978-3-030-69777-8_10