Chris van der Lee, Thiago Castro Ferreira, Chris Emmery, Travis Wiltshire and Emiel Krahmer
This study discusses the effect of semi-supervised learning in combination with large-scale, Transformers-based, pretrained language models for data-to-text generation. Previous studies have found a beneficial effect of the inclusion of semi-supervised learning techniques, as well as language models, on output quality. Both techniques aim to increase performance by extending the training data, therefore the question arises whether semi-supervised learning is still helpful when a large-scale language model is also present. This study aims to answer this question by comparing a data-to-text system only supplemented with a language model, to two data-to-text systems that are additionally enriched with a data augmentation semi-supervised learning approach and a pseudo-labeling semi-supervised learning approach, respectively. Results show that extending the training set of a data-to-text system enriched with a language model with both a data augmentation and a pseudo-labeling approach results in higher scores on diversity metrics. However, which semi-supervised learning approach is most effective differs per dataset. In terms of output quality, extending the training set of a data-to-text system with a language model using the pseudo-labeling approach did increase text quality scores, but the data augmentation approach yielded similar scores to the system without training set extension. These results indicate that semi-supervised learning approaches can bolster output quality and diversity, even when a language model is also present. Future studies could look further into which semi-supervised learning approaches are the most effective, and how a small-scale in-domain language model performs compares to a large-scale multi-domain one.