Domain-Specific Alignment of Embeddings to improve Zero/Low-Shot Performance on Biomedical NLP Tasks

Pranaydeep Singh, Ayla Rigouts Terryn and Els Lefever

Alignment of word embeddings in different languages into a common space has been a very active
research domain over the last few years and has set new benchmarks in the field of zero-shot cross-
lingual evaluation. Methods like MUSE (Conneau et al. 2017) and VecMap (Artetxe et al. 2018)
have been at the forefront of this development. These methods leverage the inherent similarities of
embedding spaces across different languages to force them together using iterative refinement. These
alignment methods can be applied in a supervised and unsupervised setting, the supervised settings
focusing on alignment using a bilingual dictionary, while the unsupervised methodologies construct
a seed dictionary from scratch using different techniques like adversarial learning and cross-domain
similarity local scaling. However, most of the research has been performed on very general corpora,
and the methods fail when applied for NLP tasks on specialized corpora. In this work, we focus
on improving alignments for specific domains, and the biomedical domain, in particular, aiming to
improve the performance on downstream tasks in this domain. We evaluate two experimental setups,
viz. low-supervision, and zero-supervision, for two language pairs, namely English-Dutch and English-
French. For the zero-supervision setting, we start from a seed dictionary generated using common
tokens like numerals, and token similarity defined via heuristics applying edit distance, stemming,
etc. For the low-supervision setting, we use aligned tokens in English-Dutch and English-French
from the ACTER dataset (Rigouts Terryn et al. 2020) to compile a seed dictionary. We compare
both alignment approaches when incorporating a general and a domain-specific seed dictionary and
evaluate the performance for bilingual lexicon induction and named entity recognition.


Artetxe, M., G. Labaka, and E. Agirre (2018), A robust self-learning method for fully unsupervised
cross-lingual mappings of word embeddings, Proceedings of the 56th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers), pp. 789–798.
Conneau, Alexis, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, and Herv ́e J ́egou
(2017), Word translation without parallel data, arXiv preprint arXiv:1710.04087.
Rigouts Terryn, Ayla, V ́eronique Hoste, and Els Lefever (2020), In No Uncertain Terms: A Dataset
for Monolingual and Multilingual Automatic Term Extraction from Comparable Corpora, Lan-
guage Resources and Evaluation 54 (2), pp. 385–418.