Detecting Perspectives on Population Sub-Groups in Pillarised Newspapers

Ryan Brate

The work of cultural heritage professionals in cataloguing items, plays an invaluable role in offering a window to the public of our collective history. Pieces are presented as structured collections, together with crafted narrative descriptions which communicate the significance of pieces to the public. In cataloguing a collection, owners and curators explain the origins and significance of pieces. However, they may inevitably do so in terms which reflect the perspective and the understanding corresponding to the specific time, cultural norms and linguistic norms in which they inhabit. Consequently, catalogued descriptions of collection items may sometimes be described in such a way that conveys a narrow (and incomplete) set of perspectives to the reader. I.e., the collection descriptions may impart biased representations of the material. For example, the Dutch Golden Age, or Gouden Eeuw, is an example of a phrase used to reference a time period of significant acclaim attributed to the Netherlands. However, in explicitly reinforcing positive connotations (golden is good), the phrase serves to emphasise the positives seen by some: to the detriment of highlighting the negatives which may have been experienced by others, e.g., aspects of colonial activity. Hence, there is scope for the development of computational approaches which are able to bring to the attention of heritage professionals, the varying perspectives associated with objects being described from external information sources. To do this, we propose to adapt a word, topic, persona hierarchical model, originally designed by Bamman et al., (2013). The simplest of the 2 models proposed by Bamman, utilises only information available within unstructured text: whereby entities are clustered into distinct characterisations according to linguistic features extracted from their contextual depictions. These features are to be extracted via pattern matching rules applied the dependency parses of the OCR’d texts. We propose to adapt this approach such to enable the comparison of people group depictions between sources. Such differences in depictions may be demonstrative of differences of perspective regarding the people groups in question. We propose to explore characterisation types identifiable in the The National Library of the Netherlands OCR’d newspaper collection. The newspaper collection represents a comprehensive resource, with an extensive back-catalogue, offering accounts on a variety of people groups. We propose to sample a variety of publications according to the phenomena of verzuiling, or polarisation. Historically, there have been clear sub-groups within the Netherlands divided by socio-political notions, with each societal group divided by separate institutions, schools, political parties and media publications, etc. Accordingly, by selecting publications across Dutch media pillars, we maximise the opportunity to develop a nuanced and comprehensive set of characterisations available for detection. In addition, a range of population sub-groups capturing varying societal roles, positions in social hierarchy will be selected as the basis for modelling. The following questions will be explored: Can we detect latent characterisation types of sub-population groups, in a comprehensive sample of the newspaper collection of the National Library of the Netherlands? If so, how people groups characterisations differ between pillarised media source and time period?