Political corpus creation through automatic speech recognition on EU debates

Hugo de Vos and Suzan Verberne

The plenary sessions of the European Parliament are a well-known source in computational linguistics. They provide semi-aligned multilingual data (Europarl corpus, Koehn 2005) or semi-aligned speech and text (Voxpopuli corpus, Wang et al. 2021). For political research, however, this data is less interesting: Due to the size of the European Parliament (705 seats, 750 before Brexit), addresses are often brief without many details. Most of the interesting debates take place in parliamentary committees: which are groups of parliamentarians that discuss specific topics in depth. The problem with the committees is that besides very brief agendas and minutes, there are no written records of their meetings. This makes it hard for political scientists to study those meetings, because this entails listening to often 3-hour-long recordings.

For this reason, we conducted a series of experiments on using Automatic Speech Recognition (ASR) to generate a transcription of the meetings of the Committee on Civil Liberties, Justice, and Home Affairs (LIBE committee) for the use in political research. This led to a corpus totaling 3,6 Million running words. We focus on the domain adaptation of an ASR pipeline building on transformer-based Wav2vec2.0 models. We experiment with multiple acoustic models, language models and the addition of domain-specific terms as hotwords.

We find that a domain-specific acoustic model and a domain-specific language model give substantial improvements to the ASR output, reducing the word error rate from 28.22 to 17.95. The use of domain-specific terms in the decoding stage does not have a positive effect on the quality of the ASR. Initial topic modelling results indicate that the corpus is useful for downstream analysis tasks. We conclude that domain adaptation leads to better results in ASR for the political domain. We release the resulting corpus and our analysis pipeline for use in future research.

Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In Proceedings of machine translation summit x: papers (pp. 79-86).

Wang, C., Riviere, M., Lee, A., Wu, A., Talnikar, C., Haziza, D., … & Dupoux, E. (2021). Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning, and interpretation. arXiv preprint arXiv:2101.00390.