Quality Estimation for the Translation Industry – Data Challenges

Javad Pourmostafa Roshan Sharami, Elena Murgolo and Dimitar Shterionov

Machine Translation (MT) has become an irreplaceable part of translation industry workflows. With a direct impact on productivity, it is very important for human post-editors and project managers to be informed about the translation quality of MT.

MT Quality estimation (QE) is the task of predicting the quality of a translation without human references. For the translation industry QE can be used as an indicator of the amount of post-editing needed as well as of productivity. QE can be applied at word-, sentence- or document-level. In the cases of sentence- and document-level QE, given a source text and its MT counterpart the task is to predict a score (typically TER) that indicates the translation quality. As with most NLP tasks nowadays, state-of-the-art QE has been achieved using DL methods. Optimal QE performance is not only a question of model architecture and hyperparameters but it strongly depends on the quantity and quality of the data.

In a business environment, such as the one of the translation industry models or system that are used in production should adhere to economic and usability criteria. QE for the translation industry should be optimised for domain- and use-case specific data, should be efficient and be adaptable. In a collaborative project between Orbital14 and Tilburg University, funded by Aglatech14, we develop a framework for MT quality assessment (MTQA) which strongly relies on QE. In this project we focus on a specific domain (patents, IP) and a language pair (English-Italian).

In this work we present our approach to data collection, analysis and preprocessing prior to building QE models. We started with proprietary data provided by Aglatech14 to identify specific patterns that need to be covered by the QE tool. As the volume of initial data was not sufficient to build robust DL models (42K source-translation-post-edit triplets, from which we compute the TER scores) we added a corpus of 127K source-human translation sentences. We used Aglatech14’s translation model to translate the source and generate a pseudo corpus of triplets (source-MT-post-edit), which were then used to compute TER scores. To further extend our data set we used publicly available data – a generic English- Italian corpus containing ~105M sentence pairs. Aiming at a domain-specific QE, we select only similar to the industry data sentence pairs. We used a ranking data selection method based on cosine similarity to selected additional 42K sentences – those with the highest similarity score to the Aglatech14 data. After data selection, we
translated the selected source data to create new synthetic data comprising of source, machine translation and target sentences. We consider the target sentences as post-edited sentences. As in the previous case, we computed the TER score between the MT and the target to use as labels for training our QE models.

We use these data to build state-of-the-art QE models and evaluate their performance against a gold standard reference set, provided by Aglatech14.