A preliminary analysis of transformers for historical text processing beyond performance metrics

Ali Hürriyetoğlu and Marieke van Erp

Transformers are surprisingly successful in tackling natural language processing tasks. These approaches are constantly being improved, extended, and specialized for many tasks and scenarios. Although these techniques have absolute strength, they also have inherent limitations (Chernyavskiy et al. 2021). Each task and scenario combination, such as historical text processing (HTP), potentially introduces additional challenges for this paradigm. Preliminary studies of variants of transformers for HTP are being reported (Manjavacas and Fonteyn (2022)), these studies have been specific to certain evaluation settings, benchmarks, performance metrics, languages, tasks, and variants of transformers. Furthermore, these studies do not provide analyses on capabilities, errors, or limitations of the transformers. But we need to know more about the utility and generalizability of transformers in processing text and in responding to research questions in digital humanities beyond performance metrics.

Use of performance metrics such as accuracy, precision, recall, and F1 is at the core of improving our methodologies for developing text processing tools. However, as the validation and application contexts of the tools differ, the reliability of these scores decreases. The types of the differences (covarience shift, concept drift), the measurement of their effect on performance scores (KL-divergence), and improving performance on data from a target context (transfer learning, domain adaptation) have been the focus of many studies. But a detailed analysis of transformers for HTP beyond performance metrics has not been reported. This knowledge could help us prevent issues arising from a lack of evaluation setting or a benchmark for our target tasks and contexts as it is even harder to create one for each task in the scope of HTP and digital humanities. Moreover, we will know where to integrate complementary approaches to transformers in our automated processing pipelines as a result of a detailed investigation.

We will present results of our investigation based on the application of BERT, RoBERTa, and macBERTH on the Odeuropa Benchmark, which is token-level annotated historical text for smell events, for smell related sentence detection in English. macBERTH, which is a BERT pre-trained from scratch using historical texts, outperforms BERT and RoBERTa by a narrow margin in average, which is .927 vs .921 median F1-macro across five random seeds for macBERTh and RoBERTa respectively. The highest scores for these two models are .931 and .926 respectively. We will discuss these findings in the following respects: i) Is this slight difference in performance worth the investment? ii) Is the problem with the benchmark data or the transformers methodology? iii) What makes macBERTh and RoBERTa better than BERT, which obtained .904 median and .917 highest F1-macro for this task? iv) How these scores are related to capabilities and limitations of transformers for HTC.

Chernyavskiy, A., Ilvovsky, D., & Nakov, P. (2021). Transformers:“The End of History” for Natural Language Processing?. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (pp. 677-693). Springer, Cham.

Manjavacas, E., & Fonteyn, L. (2022). Adapting vs Pre-training Language Models for Historical Languages. hal-03592137 URL: https://hal.inria.fr/hal-03592137